summaryrefslogtreecommitdiff
path: root/handy.h
diff options
context:
space:
mode:
authorKarl Williamson <khw@cpan.org>2021-07-17 20:41:23 -0600
committerKarl Williamson <khw@cpan.org>2021-07-17 20:55:04 -0600
commit45d6cb5ec05ebf2c464b1f7b3d7eb4e4af99dbd0 (patch)
treeaaed90d2b4764de393b3c305536827a1cd255d0c /handy.h
parent0745d3547bd8d6f88a6e8fb725bbcdfed650efac (diff)
downloadperl-45d6cb5ec05ebf2c464b1f7b3d7eb4e4af99dbd0.tar.gz
perlapi: Consolidate and revise pod for toCASE_CHANGEfoo()
This finishes the work started by 3cb048e540f681a46e641b466030b55feaf5d337 for toLOWERfoo, using the same paradigm, for the rest of the case changing macros and functions.
Diffstat (limited to 'handy.h')
-rw-r--r--handy.h262
1 files changed, 151 insertions, 111 deletions
diff --git a/handy.h b/handy.h
index 63441e2a22..acd1929f97 100644
--- a/handy.h
+++ b/handy.h
@@ -1180,79 +1180,108 @@ an API that does allow every possible legal result to be returned.) Likewise
no other function that is crippled by not being able to give the correct
results for the full range of possible inputs has been implemented here.
-=for apidoc Am|U8|toUPPER|int ch
-Converts the specified character to uppercase. If the input is anything but an
-ASCII lowercase character, that input character itself is returned. Variant
-C<toUPPER_A> is equivalent.
-
-=for apidoc Am|UV|toUPPER_uvchr|UV cp|U8* s|STRLEN* lenp
-Converts the code point C<cp> to its uppercase version, and
-stores that in UTF-8 in C<s>, and its length in bytes in C<lenp>. The code
-point is interpreted as native if less than 256; otherwise as Unicode. Note
-that the buffer pointed to by C<s> needs to be at least C<UTF8_MAXBYTES_CASE+1>
-bytes since the uppercase version may be longer than the original character.
-
-The first code point of the uppercased version is returned
-(but note, as explained at L<the top of this section|/Character case
-changing>, that there may be more.)
-
-=for apidoc Am|UV|toUPPER_utf8|U8* p|U8* e|U8* s|STRLEN* lenp
-=for apidoc_item toUPPER_utf8_safe
-Converts the first UTF-8 encoded character in the sequence starting at C<p> and
-extending no further than S<C<e - 1>> to its uppercase version, and
-stores that in UTF-8 in C<s>, and its length in bytes in C<lenp>. Note
-that the buffer pointed to by C<s> needs to be at least C<UTF8_MAXBYTES_CASE+1>
-bytes since the uppercase version may be longer than the original character.
-
-The first code point of the uppercased version is returned
-(but note, as explained at L<the top of this section|/Character case
-changing>, that there may be more).
-
-It will not attempt to read beyond S<C<e - 1>>, provided that the constraint
-S<C<s E<lt> e>> is true (this is asserted for in C<-DDEBUGGING> builds). If
-the UTF-8 for the input character is malformed in some way, the program may
-croak, or the function may return the REPLACEMENT CHARACTER, at the discretion
-of the implementation, and subject to change in future releases.
-
-C<toUPPER_utf8_safe> is now just a different spelling of plain C<toUPPER_utf8>
-
-=for apidoc Am|U8|toFOLD|U8 ch
-Converts the specified character to foldcase. If the input is anything but an
-ASCII uppercase character, that input character itself is returned. Variant
-C<toFOLD_A> is equivalent. (There is no equivalent C<to_FOLD_L1> for the full
-Latin1 range, as the full generality of L</toFOLD_uvchr> is needed there.)
-
-=for apidoc Am|UV|toFOLD_uvchr|UV cp|U8* s|STRLEN* lenp
-Converts the code point C<cp> to its foldcase version, and
-stores that in UTF-8 in C<s>, and its length in bytes in C<lenp>. The code
-point is interpreted as native if less than 256; otherwise as Unicode. Note
-that the buffer pointed to by C<s> needs to be at least C<UTF8_MAXBYTES_CASE+1>
-bytes since the foldcase version may be longer than the original character.
-
-The first code point of the foldcased version is returned
-(but note, as explained at L<the top of this section|/Character case
-changing>, that there may be more).
-
-=for apidoc Am|UV|toFOLD_utf8|U8* p|U8* e|U8* s|STRLEN* lenp
-=for apidoc_item toFOLD_utf8_safe
-Converts the first UTF-8 encoded character in the sequence starting at C<p> and
-extending no further than S<C<e - 1>> to its foldcase version, and
-stores that in UTF-8 in C<s>, and its length in bytes in C<lenp>. Note
-that the buffer pointed to by C<s> needs to be at least C<UTF8_MAXBYTES_CASE+1>
-bytes since the foldcase version may be longer than the original character.
-
-The first code point of the foldcased version is returned
-(but note, as explained at L<the top of this section|/Character case
-changing>, that there may be more).
-
-It will not attempt
-to read beyond S<C<e - 1>>, provided that the constraint S<C<s E<lt> e>> is
-true (this is asserted for in C<-DDEBUGGING> builds). If the UTF-8 for the
-input character is malformed in some way, the program may croak, or the
-function may return the REPLACEMENT CHARACTER, at the discretion of the
-implementation, and subject to change in future releases.
-
-C<toFOLD_utf8_safe> is now just a different spelling of plain C<toFOLD_utf8>
+=for apidoc Am|UV|toUPPER|UV cp
+=for apidoc_item |UV|toUPPER_A|UV cp
+=for apidoc_item |UV|toUPPER_uvchr|UV cp|U8* s|STRLEN* lenp
+=for apidoc_item |UV|toUPPER_utf8|U8* p|U8* e|U8* s|STRLEN* lenp
+=for apidoc_item |UV|toUPPER_utf8_safe|U8* p|U8* e|U8* s|STRLEN* lenp
+
+These all return the uppercase of a character. The differences are what domain
+they operate on, and whether the input is specified as a code point (those
+forms with a C<cp> parameter) or as a UTF-8 string (the others). In the latter
+case, the code point to use is the first one in the buffer of UTF-8 encoded
+code points, delineated by the arguments S<C<p .. e - 1>>.
+
+C<toUPPER> and C<toUPPER_A> are synonyms of each other. They return the
+uppercase of any lowercase ASCII-range code point. All other inputs are
+returned unchanged. Since these are macros, the input type may be any integral
+one, and the output will occupy the same number of bits as the input.
+
+There is no C<toUPPER_L1> nor C<toUPPER_LATIN1> as the uppercase of some code
+points in the 0..255 range is above that range or consists of multiple
+characters. Instead use C<toUPPER_uvchr>.
+
+C<toUPPER_uvchr> returns the uppercase of any Unicode code point. The return
+value is identical to that of C<toUPPER_A> for input code points in the ASCII
+range. The uppercase of the vast majority of Unicode code points is the same
+as the code point itself. For these, and for code points above the legal
+Unicode maximum, this returns the input code point unchanged. It additionally
+stores the UTF-8 of the result into the buffer beginning at C<s>, and its
+length in bytes into C<*lenp>. The caller must have made C<s> large enough to
+contain at least C<UTF8_MAXBYTES_CASE+1> bytes to avoid possible overflow.
+
+NOTE: the uppercase of a code point may be more than one code point. The
+return value of this function is only the first of these. The entire uppercase
+is returned in C<s>. To determine if the result is more than a single code
+point, you can do something like this:
+
+ uc = toUPPER_uvchr(cp, s, &len);
+ if (len > UTF8SKIP(s)) { is multiple code points }
+ else { is a single code point }
+
+C<toUPPER_utf8> and C<toUPPER_utf8_safe> are synonyms of each other. The only
+difference between these and C<toUPPER_uvchr> is that the source for these is
+encoded in UTF-8, instead of being a code point. It is passed as a buffer
+starting at C<p>, with C<e> pointing to one byte beyond its end. The C<p>
+buffer may certainly contain more than one code point; but only the first one
+(up through S<C<e - 1>>) is examined. If the UTF-8 for the input character is
+malformed in some way, the program may croak, or the function may return the
+REPLACEMENT CHARACTER, at the discretion of the implementation, and subject to
+change in future releases.
+
+=for apidoc Am|UV|toFOLD|UV cp
+=for apidoc_item |UV|toFOLD_A|UV cp
+=for apidoc_item |UV|toFOLD_uvchr|UV cp|U8* s|STRLEN* lenp
+=for apidoc_item |UV|toFOLD_utf8|U8* p|U8* e|U8* s|STRLEN* lenp
+=for apidoc_item |UV|toFOLD_utf8_safe|U8* p|U8* e|U8* s|STRLEN* lenp
+
+These all return the foldcase of a character. "foldcase" is an internal case
+for C</i> pattern matching. If the foldcase of character A and the foldcase of
+character B are the same, they match caselessly; otherwise they don't.
+
+The differences in the forms are what domain they operate on, and whether the
+input is specified as a code point (those forms with a C<cp> parameter) or as a
+UTF-8 string (the others). In the latter case, the code point to use is the
+first one in the buffer of UTF-8 encoded code points, delineated by the
+arguments S<C<p .. e - 1>>.
+
+C<toFOLD> and C<toFOLD_A> are synonyms of each other. They return the
+foldcase of any ASCII-range code point. In this range, the foldcase is
+identical to the lowercase. All other inputs are returned unchanged. Since
+these are macros, the input type may be any integral one, and the output will
+occupy the same number of bits as the input.
+
+There is no C<toFOLD_L1> nor C<toFOLD_LATIN1> as the foldcase of some code
+points in the 0..255 range is above that range or consists of multiple
+characters. Instead use C<toFOLD_uvchr>.
+
+C<toFOLD_uvchr> returns the foldcase of any Unicode code point. The return
+value is identical to that of C<toFOLD_A> for input code points in the ASCII
+range. The foldcase of the vast majority of Unicode code points is the same
+as the code point itself. For these, and for code points above the legal
+Unicode maximum, this returns the input code point unchanged. It additionally
+stores the UTF-8 of the result into the buffer beginning at C<s>, and its
+length in bytes into C<*lenp>. The caller must have made C<s> large enough to
+contain at least C<UTF8_MAXBYTES_CASE+1> bytes to avoid possible overflow.
+
+NOTE: the foldcase of a code point may be more than one code point. The
+return value of this function is only the first of these. The entire foldcase
+is returned in C<s>. To determine if the result is more than a single code
+point, you can do something like this:
+
+ uc = toFOLD_uvchr(cp, s, &len);
+ if (len > UTF8SKIP(s)) { is multiple code points }
+ else { is a single code point }
+
+C<toFOLD_utf8> and C<toFOLD_utf8_safe> are synonyms of each other. The only
+difference between these and C<toFOLD_uvchr> is that the source for these is
+encoded in UTF-8, instead of being a code point. It is passed as a buffer
+starting at C<p>, with C<e> pointing to one byte beyond its end. The C<p>
+buffer may certainly contain more than one code point; but only the first one
+(up through S<C<e - 1>>) is examined. If the UTF-8 for the input character is
+malformed in some way, the program may croak, or the function may return the
+REPLACEMENT CHARACTER, at the discretion of the implementation, and subject to
+change in future releases.
=for apidoc Am|UV|toLOWER|UV cp
=for apidoc_item |UV|toLOWER_A|UV cp
@@ -1311,44 +1340,55 @@ malformed in some way, the program may croak, or the function may return the
REPLACEMENT CHARACTER, at the discretion of the implementation, and subject to
change in future releases.
-=for apidoc Am|U8|toTITLE|U8 ch
-Converts the specified character to titlecase. If the input is anything but an
-ASCII lowercase character, that input character itself is returned. Variant
-C<toTITLE_A> is equivalent. (There is no C<toTITLE_L1> for the full Latin1
-range, as the full generality of L</toTITLE_uvchr> is needed there. Titlecase is
-not a concept used in locale handling, so there is no functionality for that.)
-
-=for apidoc Am|UV|toTITLE_uvchr|UV cp|U8* s|STRLEN* lenp
-Converts the code point C<cp> to its titlecase version, and
-stores that in UTF-8 in C<s>, and its length in bytes in C<lenp>. The code
-point is interpreted as native if less than 256; otherwise as Unicode. Note
-that the buffer pointed to by C<s> needs to be at least C<UTF8_MAXBYTES_CASE+1>
-bytes since the titlecase version may be longer than the original character.
-
-The first code point of the titlecased version is returned
-(but note, as explained at L<the top of this section|/Character case
-changing>, that there may be more).
-
-=for apidoc Am|UV|toTITLE_utf8|U8* p|U8* e|U8* s|STRLEN* lenp
-=for apidoc_item toTITLE_utf8_safe
-Convert the first UTF-8 encoded character in the sequence starting at C<p> and
-extending no further than S<C<e - 1>> to its titlecase version, and
-stores that in UTF-8 in C<s>, and its length in bytes in C<lenp>. Note
-that the buffer pointed to by C<s> needs to be at least C<UTF8_MAXBYTES_CASE+1>
-bytes since the titlecase version may be longer than the original character.
-
-The first code point of the titlecased version is returned
-(but note, as explained at L<the top of this section|/Character case
-changing>, that there may be more).
-
-It will not attempt
-to read beyond S<C<e - 1>>, provided that the constraint S<C<s E<lt> e>> is
-true (this is asserted for in C<-DDEBUGGING> builds). If the UTF-8 for the
-input character is malformed in some way, the program may croak, or the
-function may return the REPLACEMENT CHARACTER, at the discretion of the
-implementation, and subject to change in future releases.
-
-C<toTITLE_utf8_safe> is now just a different spelling of plain C<toTITLE_utf8>
+=for apidoc Am|UV|toTITLE|UV cp
+=for apidoc_item |UV|toTITLE_A|UV cp
+=for apidoc_item |UV|toTITLE_uvchr|UV cp|U8* s|STRLEN* lenp
+=for apidoc_item |UV|toTITLE_utf8|U8* p|U8* e|U8* s|STRLEN* lenp
+=for apidoc_item |UV|toTITLE_utf8_safe|U8* p|U8* e|U8* s|STRLEN* lenp
+
+These all return the titlecase of a character. The differences are what domain
+they operate on, and whether the input is specified as a code point (those
+forms with a C<cp> parameter) or as a UTF-8 string (the others). In the latter
+case, the code point to use is the first one in the buffer of UTF-8 encoded
+code points, delineated by the arguments S<C<p .. e - 1>>.
+
+C<toTITLE> and C<toTITLE_A> are synonyms of each other. They return the
+titlecase of any lowercase ASCII-range code point. In this range, the
+titlecase is identical to the uppercase. All other inputs are returned
+unchanged. Since these are macros, the input type may be any integral one, and
+the output will occupy the same number of bits as the input.
+
+There is no C<toTITLE_L1> nor C<toTITLE_LATIN1> as the titlecase of some code
+points in the 0..255 range is above that range or consists of multiple
+characters. Instead use C<toTITLE_uvchr>.
+
+C<toTITLE_uvchr> returns the titlecase of any Unicode code point. The return
+value is identical to that of C<toTITLE_A> for input code points in the ASCII
+range. The titlecase of the vast majority of Unicode code points is the same
+as the code point itself. For these, and for code points above the legal
+Unicode maximum, this returns the input code point unchanged. It additionally
+stores the UTF-8 of the result into the buffer beginning at C<s>, and its
+length in bytes into C<*lenp>. The caller must have made C<s> large enough to
+contain at least C<UTF8_MAXBYTES_CASE+1> bytes to avoid possible overflow.
+
+NOTE: the titlecase of a code point may be more than one code point. The
+return value of this function is only the first of these. The entire titlecase
+is returned in C<s>. To determine if the result is more than a single code
+point, you can do something like this:
+
+ uc = toTITLE_uvchr(cp, s, &len);
+ if (len > UTF8SKIP(s)) { is multiple code points }
+ else { is a single code point }
+
+C<toTITLE_utf8> and C<toTITLE_utf8_safe> are synonyms of each other. The only
+difference between these and C<toTITLE_uvchr> is that the source for these is
+encoded in UTF-8, instead of being a code point. It is passed as a buffer
+starting at C<p>, with C<e> pointing to one byte beyond its end. The C<p>
+buffer may certainly contain more than one code point; but only the first one
+(up through S<C<e - 1>>) is examined. If the UTF-8 for the input character is
+malformed in some way, the program may croak, or the function may return the
+REPLACEMENT CHARACTER, at the discretion of the implementation, and subject to
+change in future releases.
=cut