doc: New chapter "Strings and Characters".

* doc/strings.texi: New file. * doc/gnulib.texi (POSIXURL): New variable. (posixheader, posixfunc, func): New macros, from GNU libunistring's documentation. Include strings.texi. (Particular Modules): Don't include c-locale.texi here. * doc/c-locale.texi: Sections become subsections, subsections become subsubsections. * doc/posix-functions/isalnum.texi: Mention c32isalnum. * doc/posix-functions/isalpha.texi: Mention c32isalpha. * doc/posix-functions/isblank.texi: Mention c32isblank. * doc/posix-functions/iscntrl.texi: Mention c32iscntrl. * doc/posix-functions/isdigit.texi: Mention c32isdigit. * doc/posix-functions/isgraph.texi: Mention c32isgraph. * doc/posix-functions/islower.texi: Mention c32islower. * doc/posix-functions/isprint.texi: Mention c32isprint. * doc/posix-functions/ispunct.texi: Mention c32ispunct. * doc/posix-functions/isspace.texi: Mention c32isspace. * doc/posix-functions/isupper.texi: Mention c32isupper. * doc/posix-functions/isxdigit.texi: Mention c32isxdigit. * doc/posix-functions/tolower.texi: Mention alternative APIs. * doc/posix-functions/toupper.texi: Likewise. * doc/posix-functions/towlower.texi: Mention c32tolower. * doc/posix-functions/towupper.texi: Mention c32toupper. * doc/posix-functions/wcswidth.texi: Mention c32swidth. * doc/posix-functions/wcwidth.texi: Mention c32width.
author: Bruno Haible <bruno@clisp.org> 2023-05-16 02:02:13 +0200
committer: Bruno Haible <bruno@clisp.org> 2023-05-16 02:02:13 +0200
commit: 8c4d0fbf4c45df8e86acbb338b154930c5498dc3 (patch)
tree: 5401c85ea991c90093b1b2ab1d0afa6f0d0d6c60
parent: f3059fb80504e1f4e1e4c34562508f404ce7b0f7 (diff)
download: gnulib-8c4d0fbf4c45df8e86acbb338b154930c5498dc3.tar.gz
22 files changed, 1086 insertions, 29 deletions
diff --git a/ChangeLog b/ChangeLog
index b894505c2b..ecbc25ef06 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,35 @@
 2023-05-15  Bruno Haible  <bruno@clisp.org>
 
+	doc: New chapter "Strings and Characters".
+	* doc/strings.texi: New file.
+	* doc/gnulib.texi (POSIXURL): New variable.
+	(posixheader, posixfunc, func): New macros, from GNU libunistring's
+	documentation.
+	Include strings.texi.
+	(Particular Modules): Don't include c-locale.texi here.
+	* doc/c-locale.texi: Sections become subsections, subsections become
+	subsubsections.
+	* doc/posix-functions/isalnum.texi: Mention c32isalnum.
+	* doc/posix-functions/isalpha.texi: Mention c32isalpha.
+	* doc/posix-functions/isblank.texi: Mention c32isblank.
+	* doc/posix-functions/iscntrl.texi: Mention c32iscntrl.
+	* doc/posix-functions/isdigit.texi: Mention c32isdigit.
+	* doc/posix-functions/isgraph.texi: Mention c32isgraph.
+	* doc/posix-functions/islower.texi: Mention c32islower.
+	* doc/posix-functions/isprint.texi: Mention c32isprint.
+	* doc/posix-functions/ispunct.texi: Mention c32ispunct.
+	* doc/posix-functions/isspace.texi: Mention c32isspace.
+	* doc/posix-functions/isupper.texi: Mention c32isupper.
+	* doc/posix-functions/isxdigit.texi: Mention c32isxdigit.
+	* doc/posix-functions/tolower.texi: Mention alternative APIs.
+	* doc/posix-functions/toupper.texi: Likewise.
+	* doc/posix-functions/towlower.texi: Mention c32tolower.
+	* doc/posix-functions/towupper.texi: Mention c32toupper.
+	* doc/posix-functions/wcswidth.texi: Mention c32swidth.
+	* doc/posix-functions/wcwidth.texi: Mention c32width.
+
+2023-05-15  Bruno Haible  <bruno@clisp.org>
+
 	sigsegv: Add tentative support for Hurd/x86_64.
 	Based on explanations by Sergey Bugaev <bugaevc@gmail.com>.
 	* lib/sigsegv.c: Update from libsigsegv/src/fault-hurd-i386-old.h.
diff --git a/doc/c-locale.texi b/doc/c-locale.texi
index 63d11384bd..b9f6274873 100644
--- a/doc/c-locale.texi
+++ b/doc/c-locale.texi
@@ -1,5 +1,5 @@
 @node String Functions in C Locale
-@section Character and String Functions in C Locale
+@subsection Character and String Functions in C Locale
 
 The functions in this section are similar to the generic string functions
 from the standard C library, except that
@@ -12,6 +12,8 @@ They are specially optimized for the case where all characters are plain
 ASCII characters.
 @end itemize
 
+The functions are provided by the following modules.
+
 @menu
 * c-ctype::
 * c-strcase::
@@ -23,29 +25,29 @@ ASCII characters.
 @end menu
 
 @node c-ctype
-@subsection c-ctype
+@subsubsection c-ctype
 @include c-ctype.texi
 
 @node c-strcase
-@subsection c-strcase
+@subsubsection c-strcase
 @include c-strcase.texi
 
 @node c-strcaseeq
-@subsection c-strcaseeq
+@subsubsection c-strcaseeq
 @include c-strcaseeq.texi
 
 @node c-strcasestr
-@subsection c-strcasestr
+@subsubsection c-strcasestr
 @include c-strcasestr.texi
 
 @node c-strstr
-@subsection c-strstr
+@subsubsection c-strstr
 @include c-strstr.texi
 
 @node c-strtod
-@subsection c-strtod
+@subsubsection c-strtod
 @include c-strtod.texi
 
 @node c-strtold
-@subsection c-strtold
+@subsubsection c-strtold
 @include c-strtold.texi
diff --git a/doc/gnulib.texi b/doc/gnulib.texi
index 3af5cb21b2..0f91de5a39 100644
--- a/doc/gnulib.texi
+++ b/doc/gnulib.texi
@@ -82,6 +82,7 @@ Documentation License''.
 * Glibc Function Substitutes::      Replacing system functions.
 * Native Windows Support::          Support for the native Windows platforms.
 * Multithreading::                  Multiple threads of execution.
+* Strings and Characters::          Functions for strings and characters.
 * Particular Modules::              Documentation of individual modules.
 * Regular expressions::             The regex module.
 * Build Infrastructure Modules::    Modules that extend the GNU Build System.
@@ -91,6 +92,42 @@ Documentation License''.
 * Index::
 @end menu
 
+@c Location of the POSIX specification on the web.
+@set POSIXURL http://pubs.opengroup.org/onlinepubs/9699919799
+
+@c Macro for referencing a POSIX header.
+@ifinfo
+@macro posixheader{header}
+@code{<\header\>}
+@end macro
+@end ifinfo
+@ifnotinfo
+@macro posixheader{header}
+@uref{@value{POSIXURL}/basedefs/\header\.html,,@code{<\header\>}}
+@end macro
+@end ifnotinfo
+
+@c Macro for referencing a POSIX function.
+@c We don't write it as func(), see section "GNU Manuals" of the
+@c GNU coding standards.
+@ifinfo
+@macro posixfunc{func}
+@code{\func\}
+@end macro
+@end ifinfo
+@ifnotinfo
+@macro posixfunc{func}
+@uref{@value{POSIXURL}/functions/\func\.html,,@code{\func\}}
+@end macro
+@end ifnotinfo
+
+@c Macro for referencing a normal function.
+@c We don't write it as func(), see section "GNU Manuals" of the
+@c GNU coding standards.
+@macro func{func}
+@code{\func\}
+@end macro
+
 @c This is used at the beginning of four chapters.
 @macro nosuchmodulenote{thing}
 The notation ``Gnulib module: ---'' means that Gnulib does not provide a
@@ -6896,6 +6933,9 @@ to POSIX that it can be treated like any other Unix-like platform.
 @include multithread.texi
 
 
+@include strings.texi
+
+
 @node Particular Modules
 @chapter Particular Modules
 
@@ -6912,7 +6952,6 @@ to POSIX that it can be treated like any other Unix-like platform.
 * Closed standard fds::
 * Handling strings with NUL characters::
 * Container data types::
-* String Functions in C Locale::
 * Recognizing Option Arguments::
 * Quoting::
 * progname and getprogname::
@@ -6954,8 +6993,6 @@ to POSIX that it can be treated like any other Unix-like platform.
 
 @include containers.texi
 
-@include c-locale.texi
-
 @include argmatch.texi
 
 @include quote.texi
diff --git a/doc/posix-functions/isalnum.texi b/doc/posix-functions/isalnum.texi
index b538d199c1..422b55d193 100644
--- a/doc/posix-functions/isalnum.texi
+++ b/doc/posix-functions/isalnum.texi
@@ -21,7 +21,7 @@ Portability problems not fixed by Gnulib:
 Note: This function's behaviour depends on the locale, but does not support
 the multibyte characters that occur in strings in locales with
 @code{MB_CUR_MAX > 1} (this includes all the common UTF-8 locales).
-There are four alternative APIs:
+There are five alternative APIs:
 
 @table @code
 @item c_isalnum
@@ -34,6 +34,12 @@ order to use it, you first have to convert from multibyte to wide characters,
 using the @code{mbrtowc} function.  It is provided by the Gnulib module
 @samp{wctype}.
 
+@item c32isalnum
+This function operates in a locale dependent way, on 32-bit wide characters.
+In order to use it, you first have to convert from multibyte to 32-bit wide
+characters, using the @code{mbrtoc32} function.  It is provided by the
+Gnulib module @samp{c32isalnum}.
+
 @item mb_isalnum
 This function operates in a locale dependent way, on multibyte characters.
 It is provided by the Gnulib module @samp{mbchar}.
diff --git a/doc/posix-functions/isalpha.texi b/doc/posix-functions/isalpha.texi
index 2e4304ea8b..ee1c644a42 100644
--- a/doc/posix-functions/isalpha.texi
+++ b/doc/posix-functions/isalpha.texi
@@ -21,7 +21,7 @@ Portability problems not fixed by Gnulib:
 Note: This function's behaviour depends on the locale, but does not support
 the multibyte characters that occur in strings in locales with
 @code{MB_CUR_MAX > 1} (this includes all the common UTF-8 locales).
-There are four alternative APIs:
+There are five alternative APIs:
 
 @table @code
 @item c_isalpha
@@ -34,6 +34,12 @@ order to use it, you first have to convert from multibyte to wide characters,
 using the @code{mbrtowc} function.  It is provided by the Gnulib module
 @samp{wctype}.
 
+@item c32isalpha
+This function operates in a locale dependent way, on 32-bit wide characters.
+In order to use it, you first have to convert from multibyte to 32-bit wide
+characters, using the @code{mbrtoc32} function.  It is provided by the
+Gnulib module @samp{c32isalpha}.
+
 @item mb_isalpha
 This function operates in a locale dependent way, on multibyte characters.
 It is provided by the Gnulib module @samp{mbchar}.
diff --git a/doc/posix-functions/isblank.texi b/doc/posix-functions/isblank.texi
index ab23391ac6..18b09fd903 100644
--- a/doc/posix-functions/isblank.texi
+++ b/doc/posix-functions/isblank.texi
@@ -24,7 +24,7 @@ Portability problems not fixed by Gnulib:
 Note: This function's behaviour depends on the locale, but does not support
 the multibyte characters that occur in strings in locales with
 @code{MB_CUR_MAX > 1} (this includes all the common UTF-8 locales).
-There are four alternative APIs:
+There are five alternative APIs:
 
 @table @code
 @item c_isblank
@@ -37,6 +37,12 @@ order to use it, you first have to convert from multibyte to wide characters,
 using the @code{mbrtowc} function.  It is provided by the Gnulib module
 @samp{wctype}.
 
+@item c32isblank
+This function operates in a locale dependent way, on 32-bit wide characters.
+In order to use it, you first have to convert from multibyte to 32-bit wide
+characters, using the @code{mbrtoc32} function.  It is provided by the
+Gnulib module @samp{c32isblank}.
+
 @item mb_isblank
 This function operates in a locale dependent way, on multibyte characters.
 It is provided by the Gnulib module @samp{mbchar}.
diff --git a/doc/posix-functions/iscntrl.texi b/doc/posix-functions/iscntrl.texi
index 19758ef1da..c6a40314f6 100644
--- a/doc/posix-functions/iscntrl.texi
+++ b/doc/posix-functions/iscntrl.texi
@@ -21,7 +21,7 @@ Portability problems not fixed by Gnulib:
 Note: This function's behaviour depends on the locale, but does not support
 the multibyte characters that occur in strings in locales with
 @code{MB_CUR_MAX > 1} (this includes all the common UTF-8 locales).
-There are four alternative APIs:
+There are five alternative APIs:
 
 @table @code
 @item c_iscntrl
@@ -34,6 +34,12 @@ order to use it, you first have to convert from multibyte to wide characters,
 using the @code{mbrtowc} function.  It is provided by the Gnulib module
 @samp{wctype}.
 
+@item c32iscntrl
+This function operates in a locale dependent way, on 32-bit wide characters.
+In order to use it, you first have to convert from multibyte to 32-bit wide
+characters, using the @code{mbrtoc32} function.  It is provided by the
+Gnulib module @samp{c32iscntrl}.
+
 @item mb_iscntrl
 This function operates in a locale dependent way, on multibyte characters.
 It is provided by the Gnulib module @samp{mbchar}.
diff --git a/doc/posix-functions/isdigit.texi b/doc/posix-functions/isdigit.texi
index 2494170827..7d01a500b2 100644
--- a/doc/posix-functions/isdigit.texi
+++ b/doc/posix-functions/isdigit.texi
@@ -21,7 +21,7 @@ Portability problems not fixed by Gnulib:
 Note: This function's behaviour depends on the locale, but does not support
 the multibyte characters that occur in strings in locales with
 @code{MB_CUR_MAX > 1} (this includes all the common UTF-8 locales).
-There are four alternative APIs:
+There are five alternative APIs:
 
 @table @code
 @item c_isdigit
@@ -34,6 +34,12 @@ order to use it, you first have to convert from multibyte to wide characters,
 using the @code{mbrtowc} function.  It is provided by the Gnulib module
 @samp{wctype}.
 
+@item c32isdigit
+This function operates in a locale dependent way, on 32-bit wide characters.
+In order to use it, you first have to convert from multibyte to 32-bit wide
+characters, using the @code{mbrtoc32} function.  It is provided by the
+Gnulib module @samp{c32isdigit}.
+
 @item mb_isdigit
 This function operates in a locale dependent way, on multibyte characters.
 It is provided by the Gnulib module @samp{mbchar}.
diff --git a/doc/posix-functions/isgraph.texi b/doc/posix-functions/isgraph.texi
index 01fbc83cb7..a4754dda3b 100644
--- a/doc/posix-functions/isgraph.texi
+++ b/doc/posix-functions/isgraph.texi
@@ -21,7 +21,7 @@ Portability problems not fixed by Gnulib:
 Note: This function's behaviour depends on the locale, but does not support
 the multibyte characters that occur in strings in locales with
 @code{MB_CUR_MAX > 1} (this includes all the common UTF-8 locales).
-There are four alternative APIs:
+There are five alternative APIs:
 
 @table @code
 @item c_isgraph
@@ -34,6 +34,12 @@ order to use it, you first have to convert from multibyte to wide characters,
 using the @code{mbrtowc} function.  It is provided by the Gnulib module
 @samp{wctype}.
 
+@item c32isgraph
+This function operates in a locale dependent way, on 32-bit wide characters.
+In order to use it, you first have to convert from multibyte to 32-bit wide
+characters, using the @code{mbrtoc32} function.  It is provided by the
+Gnulib module @samp{c32isgraph}.
+
 @item mb_isgraph
 This function operates in a locale dependent way, on multibyte characters.
 It is provided by the Gnulib module @samp{mbchar}.
diff --git a/doc/posix-functions/islower.texi b/doc/posix-functions/islower.texi
index 8eba57ae5a..fb3a898ab7 100644
--- a/doc/posix-functions/islower.texi
+++ b/doc/posix-functions/islower.texi
@@ -21,7 +21,7 @@ Portability problems not fixed by Gnulib:
 Note: This function's behaviour depends on the locale, but does not support
 the multibyte characters that occur in strings in locales with
 @code{MB_CUR_MAX > 1} (this includes all the common UTF-8 locales).
-There are four alternative APIs:
+There are five alternative APIs:
 
 @table @code
 @item c_islower
@@ -34,6 +34,12 @@ order to use it, you first have to convert from multibyte to wide characters,
 using the @code{mbrtowc} function.  It is provided by the Gnulib module
 @samp{wctype}.
 
+@item c32islower
+This function operates in a locale dependent way, on 32-bit wide characters.
+In order to use it, you first have to convert from multibyte to 32-bit wide
+characters, using the @code{mbrtoc32} function.  It is provided by the
+Gnulib module @samp{c32islower}.
+
 @item mb_islower
 This function operates in a locale dependent way, on multibyte characters.
 It is provided by the Gnulib module @samp{mbchar}.
diff --git a/doc/posix-functions/isprint.texi b/doc/posix-functions/isprint.texi
index e30ddc958d..931776f34c 100644
--- a/doc/posix-functions/isprint.texi
+++ b/doc/posix-functions/isprint.texi
@@ -21,7 +21,7 @@ Portability problems not fixed by Gnulib:
 Note: This function's behaviour depends on the locale, but does not support
 the multibyte characters that occur in strings in locales with
 @code{MB_CUR_MAX > 1} (this includes all the common UTF-8 locales).
-There are four alternative APIs:
+There are five alternative APIs:
 
 @table @code
 @item c_isprint
@@ -34,6 +34,12 @@ order to use it, you first have to convert from multibyte to wide characters,
 using the @code{mbrtowc} function.  It is provided by the Gnulib module
 @samp{wctype}.
 
+@item c32isprint
+This function operates in a locale dependent way, on 32-bit wide characters.
+In order to use it, you first have to convert from multibyte to 32-bit wide
+characters, using the @code{mbrtoc32} function.  It is provided by the
+Gnulib module @samp{c32isprint}.
+
 @item mb_isprint
 This function operates in a locale dependent way, on multibyte characters.
 It is provided by the Gnulib module @samp{mbchar}.
diff --git a/doc/posix-functions/ispunct.texi b/doc/posix-functions/ispunct.texi
index 2f245202a2..252b5773f5 100644
--- a/doc/posix-functions/ispunct.texi
+++ b/doc/posix-functions/ispunct.texi
@@ -21,7 +21,7 @@ Portability problems not fixed by Gnulib:
 Note: This function's behaviour depends on the locale, but does not support
 the multibyte characters that occur in strings in locales with
 @code{MB_CUR_MAX > 1} (this includes all the common UTF-8 locales).
-There are four alternative APIs:
+There are five alternative APIs:
 
 @table @code
 @item c_ispunct
@@ -34,6 +34,12 @@ order to use it, you first have to convert from multibyte to wide characters,
 using the @code{mbrtowc} function.  It is provided by the Gnulib module
 @samp{wctype}.
 
+@item c32ispunct
+This function operates in a locale dependent way, on 32-bit wide characters.
+In order to use it, you first have to convert from multibyte to 32-bit wide
+characters, using the @code{mbrtoc32} function.  It is provided by the
+Gnulib module @samp{c32ispunct}.
+
 @item mb_ispunct
 This function operates in a locale dependent way, on multibyte characters.
 It is provided by the Gnulib module @samp{mbchar}.
diff --git a/doc/posix-functions/isspace.texi b/doc/posix-functions/isspace.texi
index c0817ca768..ab7b0b41d8 100644
--- a/doc/posix-functions/isspace.texi
+++ b/doc/posix-functions/isspace.texi
@@ -21,7 +21,7 @@ Portability problems not fixed by Gnulib:
 Note: This function's behaviour depends on the locale, but does not support
 the multibyte characters that occur in strings in locales with
 @code{MB_CUR_MAX > 1} (this includes all the common UTF-8 locales).
-There are four alternative APIs:
+There are five alternative APIs:
 
 @table @code
 @item c_isspace
@@ -34,6 +34,12 @@ order to use it, you first have to convert from multibyte to wide characters,
 using the @code{mbrtowc} function.  It is provided by the Gnulib module
 @samp{wctype}.
 
+@item c32isspace
+This function operates in a locale dependent way, on 32-bit wide characters.
+In order to use it, you first have to convert from multibyte to 32-bit wide
+characters, using the @code{mbrtoc32} function.  It is provided by the
+Gnulib module @samp{c32isspace}.
+
 @item mb_isspace
 This function operates in a locale dependent way, on multibyte characters.
 It is provided by the Gnulib module @samp{mbchar}.
diff --git a/doc/posix-functions/isupper.texi b/doc/posix-functions/isupper.texi
index 295e86cf4d..eb2b0ff0b6 100644
--- a/doc/posix-functions/isupper.texi
+++ b/doc/posix-functions/isupper.texi
@@ -21,7 +21,7 @@ Portability problems not fixed by Gnulib:
 Note: This function's behaviour depends on the locale, but does not support
 the multibyte characters that occur in strings in locales with
 @code{MB_CUR_MAX > 1} (this includes all the common UTF-8 locales).
-There are four alternative APIs:
+There are five alternative APIs:
 
 @table @code
 @item c_isupper
@@ -34,6 +34,12 @@ order to use it, you first have to convert from multibyte to wide characters,
 using the @code{mbrtowc} function.  It is provided by the Gnulib module
 @samp{wctype}.
 
+@item c32isupper
+This function operates in a locale dependent way, on 32-bit wide characters.
+In order to use it, you first have to convert from multibyte to 32-bit wide
+characters, using the @code{mbrtoc32} function.  It is provided by the
+Gnulib module @samp{c32isupper}.
+
 @item mb_isupper
 This function operates in a locale dependent way, on multibyte characters.
 It is provided by the Gnulib module @samp{mbchar}.
diff --git a/doc/posix-functions/isxdigit.texi b/doc/posix-functions/isxdigit.texi
index e5b78bceaa..5e13b008c4 100644
--- a/doc/posix-functions/isxdigit.texi
+++ b/doc/posix-functions/isxdigit.texi
@@ -21,7 +21,7 @@ Portability problems not fixed by Gnulib:
 Note: This function's behaviour depends on the locale, but does not support
 the multibyte characters that occur in strings in locales with
 @code{MB_CUR_MAX > 1} (this includes all the common UTF-8 locales).
-There are four alternative APIs:
+There are five alternative APIs:
 
 @table @code
 @item c_isxdigit
@@ -34,6 +34,12 @@ order to use it, you first have to convert from multibyte to wide characters,
 using the @code{mbrtowc} function.  It is provided by the Gnulib module
 @samp{wctype}.
 
+@item c32isxdigit
+This function operates in a locale dependent way, on 32-bit wide characters.
+In order to use it, you first have to convert from multibyte to 32-bit wide
+characters, using the @code{mbrtoc32} function.  It is provided by the
+Gnulib module @samp{c32isxdigit}.
+
 @item mb_isxdigit
 This function operates in a locale dependent way, on multibyte characters.
 It is provided by the Gnulib module @samp{mbchar}.
diff --git a/doc/posix-functions/tolower.texi b/doc/posix-functions/tolower.texi
index d8d2bf7f06..9911b7fb0c 100644
--- a/doc/posix-functions/tolower.texi
+++ b/doc/posix-functions/tolower.texi
@@ -16,7 +16,32 @@ OS X 10.8.
 
 Portability problems not fixed by Gnulib:
 @itemize
-@item
-On Windows and 32-bit AIX platforms, @code{wchar_t} is a 16-bit type and therefore cannot
-accommodate all Unicode characters.
 @end itemize
+
+Note: This function's behaviour depends on the locale, but does not support
+the multibyte characters that occur in strings in locales with
+@code{MB_CUR_MAX > 1} (this includes all the common UTF-8 locales).
+There are four alternative APIs:
+
+@table @code
+@item c_tolower
+This function operates in a locale independent way and returns a different
+value than the argument only for uppercase ASCII characters.  It is provided
+by the Gnulib module @samp{c-ctype}.
+
+@item towlower
+This function operates in a locale dependent way, on wide characters.  In
+order to use it, you first have to convert from multibyte to wide characters,
+using the @code{mbrtowc} function.  It is provided by the Gnulib module
+@samp{wctype}.
+
+@item c32tolower
+This function operates in a locale dependent way, on 32-bit wide characters.
+In order to use it, you first have to convert from multibyte to 32-bit wide
+characters, using the @code{mbrtoc32} function.  It is provided by the
+Gnulib module @samp{c32tolower}.
+
+@item uc_tolower
+This function operates in a locale independent way, on Unicode characters.
+It is provided by the Gnulib module @samp{unicase/tolower}.
+@end table
diff --git a/doc/posix-functions/toupper.texi b/doc/posix-functions/toupper.texi
index 36e40c45bc..86272d3ca0 100644
--- a/doc/posix-functions/toupper.texi
+++ b/doc/posix-functions/toupper.texi
@@ -16,7 +16,32 @@ OS X 10.8.
 
 Portability problems not fixed by Gnulib:
 @itemize
-@item
-On Windows and 32-bit AIX platforms, @code{wchar_t} is a 16-bit type and therefore cannot
-accommodate all Unicode characters.
 @end itemize
+
+Note: This function's behaviour depends on the locale, but does not support
+the multibyte characters that occur in strings in locales with
+@code{MB_CUR_MAX > 1} (this includes all the common UTF-8 locales).
+There are four alternative APIs:
+
+@table @code
+@item c_toupper
+This function operates in a locale independent way and returns a different
+value than the argument only for lowercase ASCII characters.  It is provided
+by the Gnulib module @samp{c-ctype}.
+
+@item towupper
+This function operates in a locale dependent way, on wide characters.  In
+order to use it, you first have to convert from multibyte to wide characters,
+using the @code{mbrtowc} function.  It is provided by the Gnulib module
+@samp{wctype}.
+
+@item c32toupper
+This function operates in a locale dependent way, on 32-bit wide characters.
+In order to use it, you first have to convert from multibyte to 32-bit wide
+characters, using the @code{mbrtoc32} function.  It is provided by the
+Gnulib module @samp{c32toupper}.
+
+@item uc_toupper
+This function operates in a locale independent way, on Unicode characters.
+It is provided by the Gnulib module @samp{unicase/toupper}.
+@end table
diff --git a/doc/posix-functions/towlower.texi b/doc/posix-functions/towlower.texi
index a8ef3ce990..b6c51a4571 100644
--- a/doc/posix-functions/towlower.texi
+++ b/doc/posix-functions/towlower.texi
@@ -23,6 +23,9 @@ Portability problems not fixed by Gnulib:
 @item
 On Windows and 32-bit AIX platforms, @code{wchar_t} is a 16-bit type and therefore cannot
 accommodate all Unicode characters.
+However, the Gnulib function @code{c32tolower}, provided by Gnulib module
+@code{c32tolower}, operates on 32-bit wide characters and therefore does not
+have this limitation.
 @item
 This function returns wrong values even for the ASCII characters
 in a zh_CN.GB18030 locale on some platforms:
diff --git a/doc/posix-functions/towupper.texi b/doc/posix-functions/towupper.texi
index 902cf16e68..bd29eec9ad 100644
--- a/doc/posix-functions/towupper.texi
+++ b/doc/posix-functions/towupper.texi
@@ -23,6 +23,9 @@ Portability problems not fixed by Gnulib:
 @item
 On Windows and 32-bit AIX platforms, @code{wchar_t} is a 16-bit type and therefore cannot
 accommodate all Unicode characters.
+However, the Gnulib function @code{c32toupper}, provided by Gnulib module
+@code{c32toupper}, operates on 32-bit wide characters and therefore does not
+have this limitation.
 @item
 This function returns wrong values even for the ASCII characters
 in a zh_CN.GB18030 locale on some platforms:
diff --git a/doc/posix-functions/wcswidth.texi b/doc/posix-functions/wcswidth.texi
index eab3393a60..02049ffdd0 100644
--- a/doc/posix-functions/wcswidth.texi
+++ b/doc/posix-functions/wcswidth.texi
@@ -18,4 +18,7 @@ Portability problems not fixed by Gnulib:
 @item
 On Windows and 32-bit AIX platforms, @code{wchar_t} is a 16-bit type and therefore cannot
 accommodate all Unicode characters.
+However, the Gnulib function @code{c32swidth}, provided by Gnulib module
+@code{c32swidth}, operates on 32-bit wide characters and therefore does not
+have this limitation.
 @end itemize
diff --git a/doc/posix-functions/wcwidth.texi b/doc/posix-functions/wcwidth.texi
index b68cb78015..1ec5f48197 100644
--- a/doc/posix-functions/wcwidth.texi
+++ b/doc/posix-functions/wcwidth.texi
@@ -29,6 +29,9 @@ Portability problems not fixed by Gnulib:
 @item
 On Windows and 32-bit AIX platforms, @code{wchar_t} is a 16-bit type and therefore cannot
 accommodate all Unicode characters.
+However, the Gnulib function @code{c32width}, provided by Gnulib module
+@code{c32width}, operates on 32-bit wide characters and therefore does not
+have this limitation.
 @item
 This function treats zero-width spaces like control characters on some
 platforms:
diff --git a/doc/strings.texi b/doc/strings.texi
new file mode 100644
index 0000000000..aa0830f1a5
--- /dev/null
+++ b/doc/strings.texi
@@ -0,0 +1,854 @@
+@node Strings and Characters
+@chapter Strings and Characters
+
+@c Copyright (C) 2009-2023 Free Software Foundation, Inc.
+
+@c Permission is granted to copy, distribute and/or modify this document
+@c under the terms of the GNU Free Documentation License, Version 1.3 or
+@c any later version published by the Free Software Foundation; with no
+@c Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.  A
+@c copy of the license is at <https://www.gnu.org/licenses/fdl-1.3.en.html>.
+
+@c Written by Bruno Haible.
+
+This chapter describes the APIs for strings and characters, provided by Gnulib.
+
+@menu
+* Strings::
+* Characters::
+@end menu
+
+@node Strings
+@section Strings
+
+Several possible representations exist for the representation of strings
+in memory of a running C program.
+
+@menu
+* C strings::
+* Strings with NUL characters::
+* Comparison of string APIs::
+@end menu
+
+@node C strings
+@subsection The C string representation
+
+The classical representation of a string in C is a sequence of
+characters, where each character takes up one or more bytes, followed by
+a terminating NUL byte.  This representation is used for strings that
+are passed by the operating system (in the @code{argv} argument of
+@code{main}, for example) and for strings that are passed to the
+operating system (in system calls such as @code{open}).  The C type to
+hold such strings is @samp{char *} or, in places where the string shall
+not be modified, @samp{const char *}.  There are many C library
+functions, standardized by ISO C and POSIX, that assume this
+representation of strings.
+
+An @emph{character encoding}, or @emph{encoding} for short, describes
+how the elements of a character set are represented as a sequence of
+bytes.  For example, in the @code{ASCII} encoding, the UNDERSCORE
+character is represented by a single byte, with value 0x5F.  As another
+example, the COPYRIGHT SIGN character is represented:
+@itemize
+@item
+in the @code{ISO-8859-1} encoding, by the single byte 0xA9,
+@item
+in the @code{UTF-8} encoding, by the two bytes 0xC2 0xA9,
+@item
+in the @code{GB18030} encoding, by the four bytes 0x81 0x30 0x84 0x38.
+@end itemize
+
+@noindent
+Note: The @samp{char} type may be signed or unsigned, depending on the
+platform.  When we talk about the "byte 0xA9" we actually mean the
+@code{char} object whose value is @code{(char) 0xA9}; we omit the cast
+to @code{char} in this documentation, for brevity.
+
+In POSIX, the character encoding is determined by the locale.  The
+locale is some environmental attribute that the user can choose.
+
+Depending on the encoding, in general, every character is represented by
+one or more bytes (up to 4 bytes in practice --- but
+use @code{MB_LEN_MAX} instead of the number 4 in the code).
+@cindex unibyte locale
+@cindex multibyte locale
+When every character is represented by only 1 byte, we speak of an
+``unibyte locale'', otherwise of a ``multibyte locale''.
+
+It is important to realize that the majority of Unix installations
+nowadays use UTF-8 or GB18030 as locale encoding; therefore, the
+majority of users are using multibyte locales.
+
+Three important facts to remember are:
+
+@cartouche
+@emph{A @samp{char} is a byte, not a character.}
+@end cartouche
+
+As a consequence:
+@itemize @bullet
+@item
+The @posixheader{ctype.h} API, that was designed only with unibyte
+encodings in mind, is useless nowadays; it does not work in
+multibyte locales.
+@item
+The @posixfunc{strlen} function does not return the number of characters
+in a string.  Nor does it return the number of screen columns occupied
+by a string after it is output.  It merely returns the number of
+@emph{bytes} occupied by a string.
+@item
+Truncating a string, for example, with @posixfunc{strncpy}, can have the
+effect of truncating it in the middle of a multibyte character.  Such
+a string will, when output, have a garbled character at its end, often
+represented by a hollow box.
+@end itemize
+
+@cartouche
+@emph{Multibyte does not imply UTF-8 encoding.}
+@end cartouche
+
+While UTF-8 is the most common multibyte encoding, GB18030 is there as
+well and will not go away within decades, because it is a Chinese
+government standard, last revised in 2022.
+
+@cartouche
+@emph{Searching for a character in a string is not the same as searching
+for a byte in the string.}
+@end cartouche
+
+Take the above example of COPYRIGHT SIGN in the @code{GB18030} encoding:
+A byte search will find the bytes @code{'0'} and @code{'8'} in this
+string.  But a search for the @emph{character} "0" or "8" in the string
+"@copyright{}" must, of course, report ``not found''.
+
+As a consequence:
+@itemize @bullet
+@item
+@posixfunc{strchr} and @posixfunc{strrchr} do not work with multibyte
+strings if the locale encoding is GB18030 and the character to be
+searched is a digit.
+@item
+@posixfunc{strstr} does not work with multibyte strings if the locale
+encoding is different from UTF-8.
+@item
+@posixfunc{strcspn}, @posixfunc{strpbrk}, @posixfunc{strspn} cannot work
+correctly in multibyte locales: they assume the second argument is a
+list of single-byte characters.  Even in this simple case, they do not
+work with multibyte strings if the locale encoding is GB18030 and one of
+the characters to be searched is a digit.
+@item
+@posixfunc{strsep} and @posixfunc{strtok_r} do not work with multibyte
+strings unless all of the delimiter characters are ASCII characters
+< 0x30.
+@item
+The @posixfunc{strcasecmp}, @posixfunc{strncasecmp}, and
+@posixfunc{strcasestr} functions do not work with multibyte strings.
+@end itemize
+
+Workarounds can be found in Gnulib, in the form of @code{mbs*} API
+functions:
+@itemize @bullet
+@item
+Gnulib has functions @func{mbslen} and @func{mbswidth} that can be used
+instead of @posixfunc{strlen} when the number of characters or the
+number of screen columns of a string is requested.
+@item
+Gnulib has functions @func{mbschr} and @func{mbsrrchr} that are like
+@posixfunc{strchr} and @posixfunc{strrchr}, but work in multibyte
+locales.
+@item
+Gnulib has a function @func{mbsstr} that is like @posixfunc{strstr}, but
+works in multibyte locales.
+@item
+Gnulib has functions @func{mbscspn}, @func{mbspbrk}, @func{mbsspn} that
+are like @posixfunc{strcspn}, @posixfunc{strpbrk}, @posixfunc{strspn},
+but work in multibyte locales.
+@item
+Gnulib has functions @func{mbssep} and @func{mbstok_r} that are like
+@posixfunc{strsep} and @posixfunc{strtok_r} but work in multibyte
+locales.
+@item
+Gnulib has functions @func{mbscasecmp}, @func{mbsncasecmp},
+@func{mbspcasecmp}, and @func{mbscasestr} that are like
+@posixfunc{strcasecmp}, @posixfunc{strncasecmp}, and
+@posixfunc{strcasestr}, but work in multibyte locales.  Still, the
+function @code{ulc_casecmp} is preferable to these functions.
+@end itemize
+
+Gnulib also has additional API.
+
+@menu
+* Iterating through strings::
+@end menu
+
+@node Iterating through strings
+@subsubsection Iterating through strings
+
+For complex string processing, the provided strings functions may not be
+enough, and what you need is a way to iterate through a string while
+processing each (possibly multibyte) character in turn.  Gnulib provides
+two modules for this purpose.  Both iterate through the string in
+forward direction.  Iteration in backward direction, that is, from the
+string's end to start, is not provided, as it is too hairy in general.
+
+@itemize
+@item
+The @code{mbiter} module.  It iterates through a C string whose length
+is already known.
+@item
+The @code{mbuiter} module.  It iterates through a C string whose length
+is not a-priori known.
+@end itemize
+
+The @code{mbuiter} module is suitable when there is a high probability
+that only the first few multibyte characters need to be inspected.
+Whereas the @code{mbiter} module is better if usually the iteration runs
+through the entire string.
+
+@node Strings with NUL characters
+@subsection Strings with NUL characters
+
+The GNU Coding Standards, section
+@ifinfo
+@ref{Semantics,,Writing Robust Programs,standards},
+@end ifinfo
+@ifnotinfo
+@url{https://www.gnu.org/prep/standards/html_node/Semantics.html},
+@end ifnotinfo
+specifies:
+@cartouche
+Utilities reading files should not drop NUL characters, or any other
+nonprinting characters.
+@end cartouche
+
+When it is a requirement to store NUL characters in strings, a variant
+of the C strings is needed.  Gnulib offers a ``string descriptor'' type
+for this purpose.  See @ref{Handling strings with NUL characters}.
+
+All remarks regarding encodings and multibyte characters in the previous
+section apply to string descriptors as well.
+
+@include c-locale.texi
+
+@node Comparison of string APIs
+@subsection Comparison of string APIs
+
+This table summarizes the API functions available for strings, in POSIX
+and in Gnulib.
+
+@multitable @columnfractions .17 .17 .17 .17 .16 .16
+@headitem unibyte strings only
+@tab assume C locale
+@tab multibyte strings
+@tab multibyte strings with NULs
+@tab wide character strings
+@tab 32-bit wide character strings
+
+@item @code{strlen}
+@tab @code{strlen}
+@tab @code{mbslen}
+@tab @code{string_desc_length}
+@tab @code{wcslen}
+@tab @code{u32_strlen}
+
+@item @code{strnlen}
+@tab @code{strnlen}
+@tab @code{mbsnlen}
+@tab --
+@tab @code{wcsnlen}
+@tab @code{u32_strnlen}, @code{u32_mbsnlen}
+
+@item @code{strcmp}
+@tab @code{strcmp}
+@tab @code{strcmp}
+@tab @code{string_desc_cmp}
+@tab @code{wcscmp}
+@tab @code{u32_strcmp}
+
+@item @code{strncmp}
+@tab @code{strncmp}
+@tab @code{strncmp}
+@tab --
+@tab @code{wcsncmp}
+@tab @code{u32_strncmp}
+
+@item @code{strcasecmp}
+@tab @code{strcasecmp}
+@tab @code{mbscasecmp}
+@tab --
+@tab @code{wcscasecmp}
+@tab @code{u32_casecmp}
+
+@item @code{strncasecmp}
+@tab @code{strncasecmp}
+@tab @code{mbsncasecmp}, @code{mbspcasecmp}
+@tab --
+@tab @code{wcsncasecmp}
+@tab @code{u32_casecmp}
+
+@item @code{strcoll}
+@tab @code{strcmp}
+@tab @code{strcoll}
+@tab --
+@tab @code{wcscoll}
+@tab @code{u32_strcoll}
+
+@item @code{strxfrm}
+@tab --
+@tab @code{strxfrm}
+@tab --
+@tab @code{wcsxfrm}
+@tab --
+
+@item @code{strchr}
+@tab @code{strchr}
+@tab @code{mbschr}
+@tab @code{string_desc_index}
+@tab @code{wcschr}
+@tab @code{u32_strchr}
+
+@item @code{strrchr}
+@tab @code{strrchr}
+@tab @code{mbsrchr}
+@tab @code{string_desc_last_index}
+@tab @code{wcsrchr}
+@tab @code{u32_strrchr}
+
+@item @code{strstr}
+@tab @code{strstr}
+@tab @code{mbsstr}
+@tab @code{string_desc_contains}
+@tab @code{wcsstr}
+@tab @code{u32_strstr}
+
+@item @code{strcasestr}
+@tab @code{strcasestr}
+@tab @code{mbscasestr}
+@tab --
+@tab --
+@tab --
+
+@item @code{strspn}
+@tab @code{strspn}
+@tab @code{mbsspn}
+@tab --
+@tab @code{wcsspn}
+@tab @code{u32_strspn}
+
+@item @code{strcspn}
+@tab @code{strcspn}
+@tab @code{mbscspn}
+@tab --
+@tab @code{wcscspn}
+@tab @code{u32_strcspn}
+
+@item @code{strpbrk}
+@tab @code{strpbrk}
+@tab @code{mbspbrk}
+@tab --
+@tab @code{wcspbrk}
+@tab @code{u32_strpbrk}
+
+@item @code{strtok_r}
+@tab @code{strtok_r}
+@tab @code{mbstok_r}
+@tab --
+@tab @code{wcstok}
+@tab @code{u32_strtok}
+
+@item @code{strsep}
+@tab @code{strsep}
+@tab @code{mbssep}
+@tab --
+@tab --
+@tab --
+
+@item @code{strcpy}
+@tab @code{strcpy}
+@tab @code{strcpy}
+@tab @code{string_desc_copy}
+@tab @code{wcscpy}
+@tab @code{u32_strcpy}
+
+@item @code{stpcpy}
+@tab @code{stpcpy}
+@tab @code{stpcpy}
+@tab --
+@tab @code{wcpcpy}
+@tab @code{u32_stpcpy}
+
+@item @code{strncpy}
+@tab @code{strncpy}
+@tab @code{strncpy}
+@tab --
+@tab @code{wcsncpy}
+@tab @code{u32_strncpy}
+
+@item @code{stpncpy}
+@tab @code{stpncpy}
+@tab @code{stpncpy}
+@tab --
+@tab @code{wcpncpy}
+@tab @code{u32_stpncpy}
+
+@item @code{strcat}
+@tab @code{strcat}
+@tab @code{strcat}
+@tab @code{string_desc_concat}
+@tab @code{wcscat}
+@tab @code{u32_strcat}
+
+@item @code{strncat}
+@tab @code{strncat}
+@tab @code{strncat}
+@tab --
+@tab @code{wcsncat}
+@tab @code{u32_strncat}
+
+@item @code{free}
+@tab @code{free}
+@tab @code{free}
+@tab @code{string_desc_free}
+@tab @code{free}
+@tab @code{free}
+
+@item @code{strdup}
+@tab @code{strdup}
+@tab @code{strdup}
+@tab @code{string_desc_copy}
+@tab @code{wcsdup}
+@tab @code{u32_strdup}
+
+@item @code{strndup}
+@tab @code{strndup}
+@tab @code{strndup}
+@tab --
+@tab --
+@tab --
+
+@item @code{mbswidth}
+@tab @code{mbswidth}
+@tab @code{mbswidth}
+@tab --
+@tab @code{wcswidth}
+@tab @code{c32swidth}, @code{u32_strwidth}
+
+@item @code{strtol}
+@tab @code{strtol}
+@tab @code{strtol}
+@tab --
+@tab --
+@tab --
+
+@item @code{strtoul}
+@tab @code{strtoul}
+@tab @code{strtoul}
+@tab --
+@tab --
+@tab --
+
+@item @code{strtoll}
+@tab @code{strtoll}
+@tab @code{strtoll}
+@tab --
+@tab --
+@tab --
+
+@item @code{strtoull}
+@tab @code{strtoull}
+@tab @code{strtoull}
+@tab --
+@tab --
+@tab --
+
+@item @code{strtoimax}
+@tab @code{strtoimax}
+@tab @code{strtoimax}
+@tab --
+@tab @code{wcstoimax}
+@tab --
+
+@item @code{strtoumax}
+@tab @code{strtoumax}
+@tab @code{strtoumax}
+@tab --
+@tab @code{wcstoumax}
+@tab --
+
+@item @code{strtof}
+@tab --
+@tab @code{strtof}
+@tab --
+@tab --
+@tab --
+
+@item @code{strtod}
+@tab @code{c_strtod}
+@tab @code{strtod}
+@tab --
+@tab --
+@tab --
+
+@item @code{strtold}
+@tab @code{c_strtold}
+@tab @code{strtold}
+@tab --
+@tab --
+@tab --
+
+@item @code{strfromf}
+@tab --
+@tab @code{strfromf}
+@tab --
+@tab --
+@tab --
+
+@item @code{strfromd}
+@tab --
+@tab @code{strfromd}
+@tab --
+@tab --
+@tab --
+
+@item @code{strfroml}
+@tab --
+@tab @code{strfroml}
+@tab --
+@tab --
+@tab --
+
+@item --
+@tab --
+@tab --
+@tab --
+@tab @code{mbstowcs}
+@tab @code{mbstoc32s}
+
+@item --
+@tab --
+@tab --
+@tab --
+@tab @code{mbsrtowcs}
+@tab @code{mbsrtoc32s}
+
+@item --
+@tab --
+@tab --
+@tab --
+@tab @code{mbsnrtowcs}
+@tab @code{mbsnrtoc32s}
+
+@item --
+@tab --
+@tab --
+@tab --
+@tab @code{wcstombs}
+@tab @code{c32stombs}
+
+@item --
+@tab --
+@tab --
+@tab --
+@tab @code{wcsrtombs}
+@tab @code{c32srtombs}
+
+@item --
+@tab --
+@tab --
+@tab --
+@tab @code{wcsnrtombs}
+@tab @code{c32snrtombs}
+
+@end multitable
+
+@node Characters
+@section Characters
+
+A @emph{character} is the elementary unit that strings are made of.
+
+What is a character?  ``A character is an element of a character set''
+is sort of a circular definition, but it highlights the fact that it is
+not merely a number.  Although many characters are visually represented
+by a single glyph, there are characters that, for example, have a
+different glyph when used at the end of a word than when used inside a
+word.  A character is also not the minimal rendered text processing
+unit; that is a grapheme cluster and in general consists of one or more
+characters.  If you want to know more about the concept of character and
+various concepts associated with characters, refer to the Unicode
+standard.
+
+For the representation in memory of a character, various types have been
+in use, and some of them were failures: @code{char} and @code{wchar_t}
+were invented for this purpose, but are not the right types.
+@code{char32_t} is the right type (successor of @code{wchar_t}); and
+@code{mbchar_t} (defined by Gnulib) is an alternative for specific kinds
+of processing.
+
+@menu
+* The char type::
+* The wchar_t type::
+* The char32_t type::
+* The mbchar_t type::
+* Comparison of character APIs::
+@end menu
+
+@node The char type
+@subsection The @code{char} type
+
+The @code{char} type is in the C language since the beginning in the
+1970ies, but --- due to its limitation of 256 possible values --- is no
+longer the adequate type for storing a character.
+
+Technically, it is still adequate in unibyte locales.  But since most
+locales nowadays are multibyte locales, it makes no sense to write a
+program that runs only in unibyte locales.
+
+ISO C and POSIX standardized an API for characters of type @code{char},
+in @code{<ctype.h>}.  This API is nowadays useless and obsolete.
+
+The important lessons to remember are:
+
+@cartouche
+@emph{A @samp{char} is just the elementary storage unit for a string,
+not a character.}
+@end cartouche
+
+@cartouche
+@emph{Never use @code{<ctype.h>}!}
+@end cartouche
+
+@node The wchar_t type
+@subsection The @code{wchar_t} type
+
+The ISO C and POSIX standard creators made an attempt to overcome the
+dead end regarding the @code{char} type.  They introduced
+@itemize @bullet
+@item
+a type @samp{wchar_t}, designed to encapsulate a character,
+@item
+a ``wide string'' type @samp{wchar_t *}, with some API functions
+declared in @posixheader{wchar.h}, and
+@item
+functions declared in @posixheader{wctype.h} that were meant to supplant
+the ones in @posixheader{ctype.h}.
+@end itemize
+
+Unfortunately, this API and its implementation has numerous problems:
+
+@itemize @bullet
+@item
+On Windows platforms and on AIX in 32-bit mode, @code{wchar_t} is a
+16-bit type.  This means that it can never accommodate an entire Unicode
+character.  Either the @code{wchar_t *} strings are limited to
+characters in UCS-2 (the ``Basic Multilingual Plane'' of Unicode), or
+--- if @code{wchar_t *} strings are encoded in UTF-16 --- a
+@code{wchar_t} represents only half of a character in the worst case,
+making the @posixheader{wctype.h} functions pointless.
+
+@item
+On Solaris and FreeBSD, the @code{wchar_t} encoding is locale dependent
+and undocumented.  This means, if you want to know any property of a
+@code{wchar_t} character, other than the properties defined by
+@posixheader{wctype.h} --- such as whether it's a dash, currency symbol,
+paragraph separator, or similar ---, you have to convert it to
+@code{char *} encoding first, by use of the function @posixfunc{wctomb}.
+
+@item
+When you read a stream of wide characters, through the functions
+@posixfunc{fgetwc} and @posixfunc{fgetws}, and when the input
+stream/file is not in the expected encoding, you have no way to
+determine the invalid byte sequence and do some corrective action.  If
+you use these functions, your program becomes ``garbage in - more
+garbage out'' or ``garbage in - abort''.
+@end itemize
+
+As a consequence, it is better to use multibyte strings.  Such multibyte
+strings can bypass limitations of the @code{wchar_t} type, if you use
+functions defined in Gnulib and GNU libunistring for text processing.
+They can also faithfully transport malformed characters that were
+present in the input, without requiring the program to produce garbage
+or abort.
+
+@node The char32_t type
+@subsection The @code{char32_t} type
+
+The ISO C and POSIX standard creators then introduced the
+@code{char32_t} type.  In ISO C 11, it was conceptually a ``32-bit wide
+character'' type.  In ISO C 23, its semantics has been further
+specified: A @code{char32_t} value is a Unicode code point.
+
+Thus, the @code{char32_t} type is not affected the problems that plague
+the @code{wchar_t} type.
+
+The @code{char32_t} type and its API are defined in the @code{<uchar.h>}
+header file.
+
+ISO C and POSIX specify only the basic functions for the @code{char32_t}
+type, namely conversion of a single character (@func{mbrtoc32} and
+@func{c32rtomb}).  For convenience, Gnulib adds API for classification
+and case conversion of characters.
+
+GNU libunistring can also be used on @code{char32_t} values.  Since
+@code{char32_t} is the same as @code{uint32_t}, all @code{u32_*}
+functions of GNU libunistring are applicable to arrays of
+@code{char32_t} values.
+
+On glibc systems, use of the 32-bit wide strings (@code{char32_t[]}) is
+exactly as efficient as the use of the older wide strings
+(@code{wchar_t[]}).  This is possible because on glibc, @code{wchar_t}
+values already always were 32-bit and Unicode code points.
+@code{mbrtoc32} is just an alias of @code{mbrtowc}.  The Gnulib
+@code{*c32*} functions are optimized so that on glibc systems they
+immediately redirect to the corresponding @code{*wc*} functions.
+
+@node The mbchar_t type
+@subsection The @code{mbchar_t} type
+
+Gnulib defines an alternate way to encode a multibyte character:
+@code{mbchar_t}.  Its main feature is the ability to process a string or
+stream with some malformed characters without reporting an error.
+
+The type @code{mbchar_t}, defined in @code{"mbchar.h"}, holds a
+character in both the multibyte and the 32-bit wide character
+representation.  In case of a malformed character only the multibyte
+representation is used.
+
+@menu
+* Reading multibyte strings::
+@end menu
+
+@node Reading multibyte strings
+@subsubsection Reading multibyte strings
+
+If you want to process (possibly multibyte) characters while reading
+them from a @code{FILE *} stream, without reading them into a string
+first, the @code{mbfile} module is made for this purpose.
+
+@node Comparison of character APIs
+@subsection Comparison of character APIs
+
+This table summarizes the API functions available for characters, in
+POSIX and in Gnulib.
+
+@multitable @columnfractions .2 .2 .2 .2 .2
+@headitem unibyte character
+@tab assume C locale
+@tab wide character
+@tab 32-bit wide character
+@tab mbchar_t character
+
+@item @code{== '\0'}
+@tab @code{== '\0'}
+@tab @code{== L'\0'}
+@tab @code{== 0}
+@tab @code{mb_isnul}
+
+@item @code{==}
+@tab @code{==}
+@tab @code{==}
+@tab @code{==}
+@tab @code{mb_equal}
+
+@item @code{isalnum}
+@tab @code{c_isalnum}
+@tab @code{iswalnum}
+@tab @code{c32isalnum}
+@tab @code{mb_isalnum}
+
+@item @code{isalpha}
+@tab @code{c_isalpha}
+@tab @code{iswalpha}
+@tab @code{c32isalpha}
+@tab @code{mb_isalpha}
+
+@item @code{isblank}
+@tab @code{c_isblank}
+@tab @code{iswblank}
+@tab @code{c32isblank}
+@tab @code{mb_isblank}
+
+@item @code{iscntrl}
+@tab @code{c_iscntrl}
+@tab @code{iswcntrl}
+@tab @code{c32iscntrl}
+@tab @code{mb_iscntrl}
+
+@item @code{isdigit}
+@tab @code{c_isdigit}
+@tab @code{iswdigit}
+@tab @code{c32isdigit}
+@tab @code{mb_isdigit}
+
+@item @code{isgraph}
+@tab @code{c_isgraph}
+@tab @code{iswgraph}
+@tab @code{c32isgraph}
+@tab @code{mb_isgraph}
+
+@item @code{islower}
+@tab @code{c_islower}
+@tab @code{iswlower}
+@tab @code{c32islower}
+@tab @code{mb_islower}
+
+@item @code{isprint}
+@tab @code{c_isprint}
+@tab @code{iswprint}
+@tab @code{c32isprint}
+@tab @code{mb_isprint}
+
+@item @code{ispunct}
+@tab @code{c_ispunct}
+@tab @code{iswpunct}
+@tab @code{c32ispunct}
+@tab @code{mb_ispunct}
+
+@item @code{isspace}
+@tab @code{c_isspace}
+@tab @code{iswspace}
+@tab @code{c32isspace}
+@tab @code{mb_isspace}
+
+@item @code{isupper}
+@tab @code{c_isupper}
+@tab @code{iswupper}
+@tab @code{c32isupper}
+@tab @code{mb_isupper}
+
+@item @code{isxdigit}
+@tab @code{c_isxdigit}
+@tab @code{iswxdigit}
+@tab @code{c32isxdigit}
+@tab @code{mb_isxdigit}
+
+@item --
+@tab --
+@tab @code{iswctype}
+@tab --
+@tab --
+
+@item @code{tolower}
+@tab @code{c_tolower}
+@tab @code{towlower}
+@tab @code{c32tolower}
+@tab --
+
+@item @code{toupper}
+@tab @code{c_toupper}
+@tab @code{towupper}
+@tab @code{c32toupper}
+@tab --
+
+@item --
+@tab --
+@tab @code{towctrans}
+@tab --
+@tab --
+
+@item --
+@tab --
+@tab @code{wcwidth}
+@tab @code{c32width}
+@tab @code{mb_width}
+
+@end multitable
author	Bruno Haible <bruno@clisp.org>	2023-05-16 02:02:13 +0200
committer	Bruno Haible <bruno@clisp.org>	2023-05-16 02:02:13 +0200
commit	8c4d0fbf4c45df8e86acbb338b154930c5498dc3 (patch)
tree	5401c85ea991c90093b1b2ab1d0afa6f0d0d6c60
parent	f3059fb80504e1f4e1e4c34562508f404ce7b0f7 (diff)
download	gnulib-8c4d0fbf4c45df8e86acbb338b154930c5498dc3.tar.gz