summaryrefslogtreecommitdiff
path: root/utf8.c
Commit message (Collapse)AuthorAgeFilesLines
* White-space, comments onlyKarl Williamson2014-01-271-0/+1
| | | | | | | This mostly indents and outdents base on blocks added or removed by the previous commit. But there are a few comment changes and vertical alignment of macro backslash continuation characters, and other white-space changes
* Work properly under UTF-8 LC_CTYPE localesKarl Williamson2014-01-271-24/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This large (sorry, I couldn't figure out how to meaningfully split it up) commit causes Perl to fully support LC_CTYPE operations (case changing, character classification) in UTF-8 locales. As a side effect it resolves [perl #56820]. The basics are easy, but there were a lot of details, and one troublesome edge case discussed below. What essentially happens is that when the locale is changed to a UTF-8 one, a global variable is set TRUE (FALSE when changed to a non-UTF-8 locale). Within the scope of 'use locale', this variable is checked, and if TRUE, the code that Perl uses for non-locale behavior is used instead of the code for locale behavior. Since Perl's internal representation is UTF-8, we get UTF-8 behavior for a UTF-8 locale. More work had to be done for regular expressions. There are three cases. 1) The character classes \w, [[:punct:]] needed no extra work, as the changes fall out from the base work. 2) Strings that are to be matched case-insensitively. These form EXACTFL regops (nodes). Notice that if such a string contains only characters above-Latin1 that match only themselves, that the node can be downgraded to an EXACT-only node, which presents better optimization possibilities, as we now have a fixed string known at compile time to be required to be in the target string to match. Similarly if all characters in the string match only other above-Latin1 characters case-insensitively, the node can be downgraded to a regular EXACTFU node (match, folding, using Unicode, not locale, rules). The code changes for this could be done without accepting UTF-8 locales fully, but there were edge cases which needed to be handled differently if I stopped there, so I continued on. In an EXACTFL node, all such characters are now folded at compile time (just as before this commit), while the other characters whose folds are locale-dependent are left unfolded. This means that they have to be folded at execution time based on the locale in effect at the moment. Again, this isn't a change from before. The difference is that now some of the folds that need to be done at execution time (in regexec) are potentially multi-char. Some of the code in regexec was trivial to extend to account for this because of existing infrastructure, but the part dealing with regex quantifiers, had to have more work. Also the code that joins EXACTish nodes together had to be expanded to account for the possibility of multi-character folds within locale handling. This was fairly easy, because it already has infrastructure to handle these under somewhat different circumstances. 3) In bracketed character classes, represented by ANYOF nodes, a new inversion list was created giving the characters that should be matched by this node when the runtime locale is UTF-8. The list is ignored except under that circumstance. To do this, I created a new ANYOF type which has an extra SV for the inversion list. The edge case that caused the most difficulty is folding involving the MICRO SIGN, U+00B5. It folds to the GREEK SMALL LETTER MU, as does the GREEK CAPITAL LETTER MU. The MICRO SIGN is the only 0-255 range character that folds to outside that range. The issue is that it doesn't naturally fall out that it will match the CAP MU. If we let the CAP MU fold to the samll mu at compile time (which it can because both are above-Latin1 and so the fold is the same no matter what locale is in effect), it could appear that the regnode can be downgraded away from EXACTFL to EXACTFU, but doing so would cause the MICRO SIGN to not case insensitvely match the CAP MU. This could be special cased in regcomp and regexec, but I wanted to avoid that. Instead the mktables tables are set up to include the CAP MU as a character whose presence forbids the downgrading, so the special casing is in mktables, and not in the C code.
* Rename an internal flagKarl Williamson2014-01-271-4/+4
| | | | | The UTF8 in the name is kind of misleading, and would be more misleading after future commits make UTF8 locales special.
* Taint more operands with case changesKarl Williamson2014-01-271-28/+9
| | | | | | | | | | The documentation says that Perl taints certain operations when subject to locale rules, such as lc() and ucfirst(). Prior to this commit there were exceptions when the operand to these functions contained no characters whose case change actually varied depending on the locale, for example the empty string or above-Latin1 code points. Changing to conform to the documentation simplifies the core code, and yields more consistent results.
* Comments, white-spaceKarl Williamson2014-01-221-4/+7
| | | | | | | | This adds and modifies various comments in several files, rewrapping some comments to occupy fewer lines but not exceed 79 columns. And fixes some indentation and other white space issues. It includes removing trailing white space in lines in regcomp.c. I didn't think it was worth making a commit for each file.
* Turn on read-only flag for some unchangeable inversion listsKarl Williamson2014-01-161-0/+3
| | | | | | These lists are read-only. Turning on the flag may allow some optimisations to be done, including some that may be added in the future.
* IDStart and IDCont no longer go out to diskKarl Williamson2014-01-091-2/+11
| | | | | | | These are the base names for various macros used in parsing identifiers. Prior to this patch, parsing a code point above Latin1 caused loading disk files. This patch causes all the information to be compiled into the Perl binary.
* isWORDCHAR_uni(), isDIGIT_utf8() etc no longer go out to diskKarl Williamson2014-01-091-11/+22
| | | | | | | Previous commits in this series have caused all the POSIX classes to be completely specified at C compile time. This allows us to revise the base function used by all these macros to use these definitions, avoiding reading them in from disk.
* utf8.c: Add commentKarl Williamson2014-01-091-1/+3
|
* utf8.c: Move a bunch of deprecated fcns to mathoms.cKarl Williamson2014-01-051-400/+0
| | | | | These functions will be out of the way in mathoms. There were a few that could not be moved, as-is, so I left them.
* utf8.c: Use existing macros instead of duplicate codeKarl Williamson2014-01-051-101/+38
| | | | | | In all these cases, there is an already existing macro that does exactly the same thing as the code that this commit replaces. No sense duplicating logic.
* Change some warnings in utf8n_to_uvchr()Karl Williamson2014-01-011-26/+26
| | | | | | | | | | | | | | | | This bottom level function decodes the first character of a UTF-8 string into a code point. It is discouraged from using it directly. This commit cleans up some of the warnings it can raise. Now, tests for malformations are done before any tests for other potential issues. One of those issues involves code points so large that they have never appeared in any official standard (the current standard has scaled back the highest acceptable code point from earlier versions). It is possible (though not done in CPAN) to warn and/or forbid these code points, while accepting smaller code points that are still above the legal Unicode maximum. The warning message for this now includes the code point if representable on the machine. Previously it always displayed raw bytes, which is what it still does for non-representable code points.
* utf8.c: Fix warning category and subcategory conflictsKarl Williamson2014-01-011-6/+6
| | | | | | | | | | | | | | | | The warnings categories non_unicode, nonchar, and surrogate are all subcategories of 'utf8'. One should never call a packWARN() with both a category and a subcategory of it, as it will mean that one can't completely make the subcategory independent. For example, use warnings 'utf8'; no warnings 'surrogate'; surrogate warnings will be output if they are tested with a ckWARN2(WARN_UTF8, WARN_SURROGATE); utf8.c was guilty of this.
* utf8.c: Don't do redundant testKarl Williamson2014-01-011-1/+1
| | | | | The test here for WARN_UTF8 is redundant, as only if one of the other three warning categories is enabled will anything actually be output.
* utf8.c: Typo in comment, and clarificationKarl Williamson2014-01-011-1/+1
|
* Remove no-longer used inversion list functionKarl Williamson2013-12-311-1/+1
| | | | | | | | | | | The function _invlist_invert_prop() is hereby removed. The recent changes to allow \p{} to match above-Unicode means that no special handling of properties need be done when inverting. This function was accessible to XS code that cheated by using #defines to pretend it was something it wasn't, but it also has been marked as subject to change since its inception, and never appeared in any documentation.
* White-space onlyKarl Williamson2013-12-311-28/+28
| | | | | This indents various newly-formed blocks (by the previous commit) in these three files, and reflows lines to fit into 79 columns
* Change format of mktables output binary property tablesKarl Williamson2013-12-311-0/+28
| | | | | | | | | mktables now outputs the tables for binary properties as inversion lists, with a size as the first element. This means simpler handling of these tables in the core, including removal of an entire pass over them (it was done just to get the size). These tables are marked as for internal use by the Perl core only, so their format is changeable at will.
* perlapi: Consistent spaces after dotsFather Chrysostomos2013-12-291-9/+12
| | | | plus some typo fixes. I probably changed some things in perlintern, too.
* utf8.c: White-space onlyKarl Williamson2013-12-061-3/+4
| | | | Rearrange this multi-line conditional to be easier to read.
* perlapi: Grammar nitsKarl Williamson2013-12-061-6/+6
| | | | | "The" referring to a parameter here does not look right to me, a native English speaker.
* utf8.c: Remove hard-coded names.Karl Williamson2013-12-061-8/+21
| | | | | | | The names of these hashes stored in some disk files is retrievable by a standardized lookup. There is no need to have them hard-coded in C code. This is one less opportunity for the file and the code to get out of sync.
* perlapi: NitsKarl Williamson2013-12-041-2/+2
|
* utf8.c: Use U8 instead of UV in several placesKarl Williamson2013-12-031-4/+4
| | | | | | These temporaries are all known to fit into 8 bits; by using a U8 it should be more obvious to an optimizing compiler, and so the bounds checking need not be done.
* fix -Wsign-compare in coreDavid Mitchell2013-11-291-4/+8
| | | | | | | | | | | | | There were a few places that were doing unsigned_var = cond ? signed_val : unsigned_val; or similar. Fixed by suitable casts etc. The four in utf8.c were fixed by assigning to an intermediate unsigned var; this has the happy side-effect of collapsing a large macro expansion, where toUPPER_LC() etc evaluate their arg multiple times.
* utf8.c: White-space onlyKarl Williamson2013-10-161-9/+9
| | | | | This outdents code to the proper level given that the surrounding block has been removed.
* Change mktables output for some tables to use hexKarl Williamson2013-10-161-12/+1
| | | | | | | | | | | | | | | | | | | This makes all the tables in the lib/unicore/To directory that map from code point to code point be formatted so that the mapped-to code point is expressed as hexadecimal. This allows for uniform treatment of these tables in utf8.c, and removes the final use of strtol() in the (non-CPAN) core. strtol() should be avoided because it is subject to locale rules, and some older libc implementations have been buggy. It was used because Perl doesn't have an efficient way of parsing a decimal number and advancing the parse pointer to beyond it; we do have such a method for hex numbers. The input to mktables published by Unicode is also in hex, so this now conforms to that convention. This also will facilitate the new work currently being done to read in the tables that find the closing bracket given an opening one.
* utf8.c: Silence Win32 compiler warningsKarl Williamson2013-09-301-8/+8
| | | | | The Win32 compiler doesn't realize that the values in these places can be a max of 255. Other compilers don't warn.
* Removed an ifdef for IS_UTF8_CHAR in utf8.cBrian Fraser2013-09-211-2/+0
| | | | | | IS_UTF8_CHAR is defined by utf8.h, so this is always defined. In fact, later in utf8.c we use it again, this time without the ifdef.
* The choice of 7 or 13 byte extended UTF-8 should be based on UVSIZE.Nicholas Clark2013-09-171-2/+2
| | | | Previously it was based on HAS_QUAD, which is not (as) correct.
* perlapi: Typos; clarify commentKarl Williamson2013-09-161-4/+6
|
* Use separate macros for byte vs uv UnicodeKarl Williamson2013-09-101-3/+3
| | | | | | | This removes a macro not yet even in a development release, and splits its calls into two classes: those where the input is a byte; and those where it can be any unsigned integer. The byte implementation avoids a function call on EBCDIC platforms.
* Move functions prematurely placed into mathoms back to utf8.cKarl Williamson2013-09-041-0/+58
| | | | | | | These functions are still called by some CPAN-upstream modules, so can't go into mathoms until those are fixed. There are other changes needed in these modules, so I'm deferring sending patching to their maintainers until I know all the necessary changes.
* perlapi: Remove newly obsolete statementKarl Williamson2013-09-041-2/+1
| | | | | Since commit 010ab96b9b802bbf77168b5af384569e053cdb63, this function is now longer a wrapper, so shouldn't be described as such.
* utf8.c: Add commentKarl Williamson2013-08-291-3/+11
|
* utf8.c: Add omitted fold casesKarl Williamson2013-08-291-5/+24
| | | | | | | | | | | | | | | | | | | The LATIN SMALL LETTER SHARP S can't fold to 'ss' under /iaa because the definition of /aa prohibits it, but it can fold to two consecutive instances of LATIN SMALL LETTER LONG S. A capital sharp s can do the same, and that was fixed in 1ca267a5, but this one was overlooked then. It turns out that another possibility was also overlooked in 1ca267a5. Both U+FB05 (LATIN SMALL LIGATURE LONG S T) and U+FB06 (LATIN SMALL LIGATURE ST) fold to the string 'st', except under /iaa these folds are prohibited. But U+FB05 and U+FB06 are equivalent to each other under /iaa. This wasn't working until now. This commit changes things so both fold to FB06. This bug would only surface during /iaa matching, and I don't believe there are any current code paths which lead to it, hence no tests are added by this commit. However, a future commit will lead to this bug, and existing tests find it then.
* utf8.c: Move some code around for speedKarl Williamson2013-08-291-5/+7
| | | | | This is a micro optimization. We now check for a common case and return if found, before checking for a relatively uncommon case.
* utf8.c: Remove wrapper functions.Karl Williamson2013-08-291-103/+80
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that the Unicode data is stored in native character set order, it is rare to need to work with the Unicode order. Traditionally, the real work was done in functions that worked with the Unicode order, and wrapper functions (or macros) were used to translate to/from native. There are two groups of functions: one that translates from code point to UTF-8, and the other group goes the opposite direction. This commit changes the base function that translates from UTF-8 to code point to output native instead of Unicode. Those extremely rare instances where Unicode output is needed instead will have to hand-wrap calls to this function with a translation macro, as now described in the API pod. Prior to this, it was the other way, the native was wrapped, and the rare, strict Unicode wasn't. This eliminates a layer of function call overhead for a common case. The base function that translates from code point to UTF-8 retains its Unicode input, as that is more natural to process. However, it is de-emphasized in the pod, with the functionality description moved to the pod for a native input wrapper function. And, those wrappers are now macros in all cases; previously there was function call overhead sometimes. (Equivalent exported functions are retained, however, for XS code that uses the Perl_foo() form.) I had hoped to rebase this commit, squashing it with an earlier commit in this series, eliminating the use of a temporary function name change, but the work involved turns out to be large, with no real payoff.
* perlapi vis utf8.c: NitsKarl Williamson2013-08-291-5/+4
|
* utf8.c: Move 2 functions to earlier in fileKarl Williamson2013-08-291-36/+36
| | | | | This moves these two functions to be adjacent to the function they each call, thus keeping like things together.
* utf8.c: Don't use slower general-purpose functionKarl Williamson2013-08-291-3/+7
| | | | | | There is a macro that accomplishes the same task for a two byte UTF-8 encoded character, and avoids the overhead of the general purpose function call.
* utf8.c: Don't do ++ in macro parameterKarl Williamson2013-08-291-2/+3
| | | | | The formal parameter gets evaluated multiple times on an EBCDIC platform, thus incrementing more than the intended once.
* utf8.c: Use macro instead of duplicating codeKarl Williamson2013-08-291-13/+13
| | | | There is a macro that accomplishes this task, and is easier to read.
* utf8.c: Avoid unnecessary UTF-8 conversionsKarl Williamson2013-08-291-27/+62
| | | | | | | | | | | | This changes the code so that converting to UTF-8 is avoided unless necessary. For such inputs, the conversion back from UTF-8 is also avoided. The cost of doing this is that the first swatches are combined into one that contains the values for all characters 0-255, instead of having multiple swatches. That means when first calculating the swatch it calculates all 256, instead of 128 (160 on EBCDIC). This also fixes an EBCDIC bug in which characters in this range were being translated twice.
* utf8.c: No need to check for UTF-8 malformationsKarl Williamson2013-08-291-5/+3
| | | | | | | | This function assumes that the input is well-formed UTF-8, even though until this commit, the prefatory comments didn't say so. The API does not pass the buffer length, so there is no way it could check for reading off the end of the buffer. One code path already calls valid_utf8_to_uvchr(); this changes the remaining code path to correspond.
* utf8.c: Fix so UTF-16 to UTF-8 conversion works under EBCDICKarl Williamson2013-08-291-5/+9
|
* Fix valid_utf8_to_uvchr() for EBCDICKarl Williamson2013-08-291-2/+6
|
* Fix EBCDIC bugs in UTF8_ACUMULATE and utf8.cKarl Williamson2013-08-291-1/+1
|
* utf8.c: Use more clearly named macroKarl Williamson2013-08-291-1/+1
| | | | | | In the case of invariants these two macros should do the same thing, but it seems to me that the latter name more clearly indicates what is going on.
* Add macro OFFUNISKIPKarl Williamson2013-08-291-3/+3
| | | | | | | | | This means use official Unicode code point numbering, not native. Doing this converts the existing UNISKIP calls in the code to refer to native code points, which is what they meant anyway. The terminology is somewhat ambiguous, but I don't think it will cause real confusion. NATIVE_SKIP is also introduced for situations where it is important to be precise.