summaryrefslogtreecommitdiff
path: root/utf8.c
Commit message (Collapse)AuthorAgeFilesLines
* make more use of NOT_REACHEDLukas Mai2014-11-291-2/+2
| | | | In particular, remove all instances of 'assert(0);'.
* Make is_invariant_string()Karl Williamson2014-11-261-6/+5
| | | | | | This is a more accurately named synonym for is_ascii_string(), which is retained. The old name is misleading to someone programming for non-ASCII platforms.
* Improve API pod of is_ascii_stringKarl Williamson2014-11-261-4/+8
|
* utf8.c: Shorten long constant names, and simplifyKarl Williamson2014-11-241-6/+10
| | | | | | | The previous commit fixed a typo caused by it being hard to see the differences in a long ALL_CAP name. This uses #defines to type the long name only once, and compile-time variables so the expression for the length of strings only is specified once.
* utf8.c: Was taking sizeof() wrong thingKarl Williamson2014-11-241-1/+1
| | | | | | This was a typo due to the long name. A future commit will make it cleaner. The sizeof() the wrong name evaluates to the right number on ASCII platforms, but not EBCDIC.
* Add warning message for locale/Unicode intermixingKarl Williamson2014-11-141-5/+21
| | | | This is explained in the added perldiag entry.
* uvoffuni_to_utf8_flags() die if platform can't handleKarl Williamson2014-10-211-0/+9
| | | | | | | | | | | | | | | | | On non EBCDIC platforms currently any UV is encodable as UTF-8. (This would change if there were 128-bit words). Thus, much code assumes that nothing can go wrong when converting to UTF-8, and hence does no error checking. However, UTF-EBCDIC is only capable of representing code points below 2**32, so if there are 64-bit words, this function can fail. Prior to this patch, there was no real overflow check, and garbage was returned by this function if called with too large a number. While not ideal, the easiest thing to do is to just die for such a number, like we do for division by 0. This involves changing only code within this function, and not its many callers.
* utf8.c: Improve debug messageKarl Williamson2014-10-211-2/+2
| | | | | | This function was called with an empty string "" because that string was not actually needed in the function, except to better identify the source when there is an error. So change to specify the actual source.
* utf8.c: Move an #ifndef for clarityFather Chrysostomos2014-09-121-1/+1
| | | | | The comment really belongs inside it, as it refers to those two lines of code.
* Remove obsolete comment from utf8.cFather Chrysostomos2014-09-121-8/+0
| | | | | | | | | | The call to save_re_context was removed by the previous commit. The commit before that stopped save_re_context from doing anything. Commit db2c6cb33 stopped the errsv_save line from triggering get-magic. So this comment, added in dc0c6abb4, no longer applies.
* Don’t call save_re_contextFather Chrysostomos2014-09-121-1/+4
| | | | It is an empty function.
* perl #122747: localize PL_curpm to null in _core_swash_initYves Orton2014-09-111-2/+17
| | | | | | | | | | | | Set PL_curpm to null before we do any swash intialization in _core_swash_init(). This "hides" the current regop from the swash code, with the intent of prevent weird reentrancy bugs when the swashes are initialized. Long term you could argue that we should just not use the regex engine to initialize a swash, and then this would be unnecessary. Thanks to FC for the suggestion!
* utf8.c: Use slightly more efficient macroKarl Williamson2014-07-251-2/+4
| | | | | | | | Lowercasing a Latin-1 range character results in a latin-1 range character, so we can use the more restrictive macros that is slightly more efficient than the general ones. (This difference only is applicable on EBCDIC platforms, as the macros all expand to nothing on ASCII ones.
* Use grok_atou instead of strtoul (no explicit strtol uses).Jarkko Hietaniemi2014-07-221-7/+10
|
* Remove or downgrade unnecessary dVAR.Jarkko Hietaniemi2014-06-251-35/+0
| | | | | | | | You need to configure with g++ *and* -Accflags=-DPERL_GLOBAL_STRUCT or -Accflags=-DPERL_GLOBAL_STRUCT_PRIVATE to see any difference. (g++ does not do the "post-annotation" form of "unused".) The version code has some of these issues, reported upstream.
* PERL_UNUSED_CONTEXT -> remove interp context where possibleDaniel Dragan2014-06-241-3/+1
| | | | | | | | | | | | | | | | | | | | | Removing context params will save machine code in the callers of these functions, and 1 ptr of stack space. Some of these funcs are heavily used as mg_find*. The contexts can always be readded in the future the same way they were removed. This patch inspired by commit dc3bf40570. Also remove PERL_UNUSED_CONTEXT when its not needed. See removal candidate rejection rational in [perl #122106]. -Perl_hv_backreferences_p uses context in S_hv_auxinit commit 96a5add60f was wrong -Perl_whichsig_sv and Perl_whichsig_pv wrongly used PERL_UNUSED_CONTEXT from inception in commit 84c7b88cca -in authors opinion cast_* shouldn't be public API, no CPAN grep usage, can't be static and/or inline optimized since it is exported -Perl_my_unexec move to block where it is needed, make Win32 block, context free, for inlining likelyhood, private api and only 2 callers in core -Perl_my_dirfd make all blocks context free, then change proto -Perl_bytes_cmp_utf8 wrongly used PERL_UNUSED_CONTEXT from inception in commit fed3ba5d6b
* Silence -Wunused-parameter my_perl under threads.Jarkko Hietaniemi2014-06-191-3/+4
| | | | | | | | | | | | | | For S_ functions, remove the context. For Perl_ functions, add PERL_UNUSED_CONTEXT. Tricky because sometimes depends on DEBUGGING, and sometimes on whether we are have PERL_IMPLICIT_SYS. (Why all the mathoms Perl_is_uni_... and Perl_is_utf8_... functions are not being whined about is a mystery.) vutil.c (included via util.c) has one of these, but it's cpan/, and a known problem: https://rt.cpan.org/Ticket/Display.html?id=96100
* Revert "/* NOTREACHED */ belongs *before* the unreachable."Jarkko Hietaniemi2014-06-191-4/+2
| | | | | | This reverts commit 148f39b7de6eae9ddd59e0b0aff691d6abea7aca. (Still needs more work, but wanted to see how well this passed with Jenkins.)
* /* NOTREACHED */ belongs *before* the unreachable.Jarkko Hietaniemi2014-06-191-2/+4
| | | | | | Definitely not *after* it. It marks the start of the unreachable, not the first unrechable line. And if they are in that order, it looks better to linebreak after the lint hint.
* Some low-hanging -Wunreachable-code fruits.Jarkko Hietaniemi2014-06-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | - after return/croak/die/exit, return/break are pointless (break is not a terminator/separator, it's a goto) - after goto, another goto (!) is pointless - in some cases (usually function ends) introduce explicit NOT_REACHED to make the noreturn nature clearer (do not do this everywhere, though, since that would mean adding NOT_REACHED after every croak) - for the added NOT_REACHED also add /* NOTREACHED */ since NOT_REACHED is for gcc (and VC), while the comment is for linters - declaring variables in switch blocks is just too fragile: it kind of works for narrowing the scope (which is nice), but breaks the moment there are initializations for the variables (the initializations will be skipped since the flow will bypass the start of the block); in some easy cases simply hoist the declarations out of the block and move them earlier Note 1: Since after this patch the core is not yet -Wunreachable-code clean, not enabling that via cflags.SH, one needs to -Accflags=... it. Note 2: At least with the older gcc 4.4.7 there are far too many "unreachable code" warnings, which seem to go away with gcc 4.8, maybe better flow control analysis. Therefore, the warning should eventually be enabled only for modernish gccs (what about clang and Intel cc?)
* rmv duplicate SvUV call in Perl__swash_inversion_hashDarin McBride2014-06-141-3/+5
|
* Revert "Some low-hanging -Wunreachable-code fruits."Jarkko Hietaniemi2014-06-131-1/+1
| | | | | | | This reverts commit 8c2b19724d117cecfa186d044abdbf766372c679. I don't understand - smoke-me came back happy with three separate reports... oh well, some other time.
* Some low-hanging -Wunreachable-code fruits.Jarkko Hietaniemi2014-06-131-1/+1
| | | | | | | | | | | | | | | | | | - after croak/die/exit (or return), break (or return!) are pointless (break is not a terminator/separator, it's a promise of a jump) - after goto, another goto (!) is pointless - in some cases (usually function ends) introduce explicit NOT_REACHED to make the noreturn nature clearer (do not do this everywhere, though, since that would mean adding NOT_REACHED after every croak) - for the added NOT_REACHED also add /* NOTREACHED */ since NOT_REACHED is for gcc (and VC), while the comment is for linters - declaring variables in switch blocks is just too fragile: it kind of works for narrowing the scope (which is nice), but breaks the moment there are initializations for the variables (they will be skipped!); in some easy cases simply hoist the declarations out of the block and move them earlier There are still a few places left.
* perlapi: Include general informationKarl Williamson2014-06-051-2/+1
| | | | | | | | | | | Unlike other pod handling routines, autodoc requires the line following an =head1 to be non-empty for its text to be included in the paragraph started by the heading. If you fail to do this, silently the text will be omitted from perlapi. I went through the source code, and where it was apparent that the text was supposed to be in perlapi, deleted the empty line so it would be, with some revisions to make more sense. I added =cuts where I thought it best for the text to not be included.
* Move some deprecated utf8-handling functions to mathomsKarl Williamson2014-05-311-136/+17
| | | | | This entailed creating new internal functions for some of them to call so that the functionality can be retained during the deprecation period.
* Make is_utf8_char_buf() a macroKarl Williamson2014-05-311-1/+1
| | | | | | This function is now more efficiently implemented as a synonym for isUTF8_CHAR(). We retain the Perl_is_utf8_char_buf() function for code that calls it that way.
* Create isUTF8_CHAR() macro and use itKarl Williamson2014-05-311-68/+15
| | | | | | | | | | | | | | | | | | This macro will inline the code to determine if a character is well-formed UTF-8 for code points below a certain value, falling back to a slower function for larger ones. On ASCII platforms, it will inline for well-beyond all legal Unicode code points. On EBCDIC, it currently does it for code points up to 0x3FFF. This could be increased, but our porting tests do the regen every time to make sure everything is ok, and making it larger slows that down. This is worked around on ASCII by normally commenting out the code that generates this info, but including in utf8.h a version that did get generated. This is static information and won't change. (This could be done for EBCDIC too, but I chose not to at this time as each code page has a different macro generated, and it gets ugly getting all of them in utf8.h) Using this macro allowed for simplification of several functions in utf8.c
* utf8.c: Move a static function to inline.hKarl Williamson2014-05-311-35/+3
| | | | | This is in preparation for it being called from outside utf8.c. It is renamed to have a leading underscore to emphasize its private nature
* utf8.c: Move documentation next to its functionKarl Williamson2014-05-301-16/+16
| | | | Somehow this pod stuff was orphaned from the function it describes.
* utf8.c: Silence compiler warningKarl Williamson2014-05-291-1/+1
| | | | | | | | | This was brought to my attention by Jarkko Hietaniemi. The compiler was complaining that a variable could be used uninitialized. In practice this doesn't happen, as it would only happen on bad data, and Perl itself generates the data used. (I suppose if the data got corrupted, it could happen.) This commit initializes the value unconditionally, which allows a conditional setting of it to be removed.
* utf8.c: Move static function to embed.fncKarl Williamson2014-05-291-6/+8
| | | | This automatically generates assertions for pointer arguments, etc.
* UV casting to avoid intermediate sign extension.Jarkko Hietaniemi2014-05-291-7/+10
| | | | | | | | | [perl #121746] Fix for Coverity perl5 CIDs 29069, 29070, 29071: Unintended sign extension: ... ... if ... U8 (8 bits unsigned) ... 32 bits, signed ... 64 bits, unsigned ... is greater than 0x7FFFFFFF, the upper bits of the result will all be 1.
* slen may be uninitialized.Jarkko Hietaniemi2014-05-281-1/+1
| | | | | | | | Fix for Coverity perl5 CID 29081: Uninitialized scalar variable (UNINIT) uninit_use_in_call: Using uninitialized value slen when calling Perl_croak. If all fails, slen hasn't been set, and croak will be called with that.
* perlapi: Give more accurate value for needed free spaceKarl Williamson2014-05-121-6/+6
| | | | | | When converting to UTF-8, one usually doesn't need 14 bytes available space, which is what previously was claimed It acutally depends on the value being converted. This change gives the precise value.
* perlapi: Clarify NUL handling for 2 fcns; nitsKarl Williamson2014-04-231-8/+10
| | | | | The string input to these two functions must be NUL terminated when the length parameter is 0.
* White-space only; properly indent newly formed blocksKarl Williamson2014-03-141-13/+13
| | | | | The previous commit added braces forming blocks. This indents the contents of those blocks.
* mktables: Inline short tablesKarl Williamson2014-03-141-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mktables generates tables of Unicode properties. These are stored in files to be loaded on-demand. This is because the memory cost of having all of them loaded would be excessive, and many are rarely used. Hashes are created in Heavy.pl which is read in by utf8_heavy.pl map the Unicode property name to the file which contains its definition. It turns out that nearly half of current Unicode properties are just a single consecutive ranges of code points, and their definitions are representable almost as compactly as the name of the files that contain them. This commit changes mktables so that the tables for single-range properties are not written out to disk, but instead a special syntax is used in Heavy.pl to indicate this and what their definitions are. This does not increase the memory usage of Heavy.pl appreciably, as the definitions replace the file names that are already there, but it lowers the number of files generated by mktables from 908 (in Unicode 6.3) to 507. These files are probably each a disk block, so the disk savings is not large. But it means that reading in any of these properties is much faster, as once utf8_heavy gets loaded, no further disk access is needed to get any of these properties. Most of these properties are obscure, but not all. The Line and Paragraph separators, for example, are quite commonly used. Further, utf8_heavy.pl caches the files it has read in into hashes. This is not necessary for these, as they are already in memory, so the total memory usage goes down if a program uses any of these, but again, since these are small, that amount is not large.. The major gain is not having to read these files from disk at run time. Tables that match no code points at all are also represented using this mechanimsm. Previously, they were expressed as the complements of \p{All}, which matches everything possible.
* Change av_len calls to av_tindex for clarityKarl Williamson2014-02-201-4/+4
| | | | | | av_tindex is a more clearly named synonym for av_len, available starting in v5.18. This changes the core uses to it, including modules in /ext, which are not dual-lifed.
* White-space, comments onlyKarl Williamson2014-01-271-0/+1
| | | | | | | This mostly indents and outdents base on blocks added or removed by the previous commit. But there are a few comment changes and vertical alignment of macro backslash continuation characters, and other white-space changes
* Work properly under UTF-8 LC_CTYPE localesKarl Williamson2014-01-271-24/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This large (sorry, I couldn't figure out how to meaningfully split it up) commit causes Perl to fully support LC_CTYPE operations (case changing, character classification) in UTF-8 locales. As a side effect it resolves [perl #56820]. The basics are easy, but there were a lot of details, and one troublesome edge case discussed below. What essentially happens is that when the locale is changed to a UTF-8 one, a global variable is set TRUE (FALSE when changed to a non-UTF-8 locale). Within the scope of 'use locale', this variable is checked, and if TRUE, the code that Perl uses for non-locale behavior is used instead of the code for locale behavior. Since Perl's internal representation is UTF-8, we get UTF-8 behavior for a UTF-8 locale. More work had to be done for regular expressions. There are three cases. 1) The character classes \w, [[:punct:]] needed no extra work, as the changes fall out from the base work. 2) Strings that are to be matched case-insensitively. These form EXACTFL regops (nodes). Notice that if such a string contains only characters above-Latin1 that match only themselves, that the node can be downgraded to an EXACT-only node, which presents better optimization possibilities, as we now have a fixed string known at compile time to be required to be in the target string to match. Similarly if all characters in the string match only other above-Latin1 characters case-insensitively, the node can be downgraded to a regular EXACTFU node (match, folding, using Unicode, not locale, rules). The code changes for this could be done without accepting UTF-8 locales fully, but there were edge cases which needed to be handled differently if I stopped there, so I continued on. In an EXACTFL node, all such characters are now folded at compile time (just as before this commit), while the other characters whose folds are locale-dependent are left unfolded. This means that they have to be folded at execution time based on the locale in effect at the moment. Again, this isn't a change from before. The difference is that now some of the folds that need to be done at execution time (in regexec) are potentially multi-char. Some of the code in regexec was trivial to extend to account for this because of existing infrastructure, but the part dealing with regex quantifiers, had to have more work. Also the code that joins EXACTish nodes together had to be expanded to account for the possibility of multi-character folds within locale handling. This was fairly easy, because it already has infrastructure to handle these under somewhat different circumstances. 3) In bracketed character classes, represented by ANYOF nodes, a new inversion list was created giving the characters that should be matched by this node when the runtime locale is UTF-8. The list is ignored except under that circumstance. To do this, I created a new ANYOF type which has an extra SV for the inversion list. The edge case that caused the most difficulty is folding involving the MICRO SIGN, U+00B5. It folds to the GREEK SMALL LETTER MU, as does the GREEK CAPITAL LETTER MU. The MICRO SIGN is the only 0-255 range character that folds to outside that range. The issue is that it doesn't naturally fall out that it will match the CAP MU. If we let the CAP MU fold to the samll mu at compile time (which it can because both are above-Latin1 and so the fold is the same no matter what locale is in effect), it could appear that the regnode can be downgraded away from EXACTFL to EXACTFU, but doing so would cause the MICRO SIGN to not case insensitvely match the CAP MU. This could be special cased in regcomp and regexec, but I wanted to avoid that. Instead the mktables tables are set up to include the CAP MU as a character whose presence forbids the downgrading, so the special casing is in mktables, and not in the C code.
* Rename an internal flagKarl Williamson2014-01-271-4/+4
| | | | | The UTF8 in the name is kind of misleading, and would be more misleading after future commits make UTF8 locales special.
* Taint more operands with case changesKarl Williamson2014-01-271-28/+9
| | | | | | | | | | The documentation says that Perl taints certain operations when subject to locale rules, such as lc() and ucfirst(). Prior to this commit there were exceptions when the operand to these functions contained no characters whose case change actually varied depending on the locale, for example the empty string or above-Latin1 code points. Changing to conform to the documentation simplifies the core code, and yields more consistent results.
* Comments, white-spaceKarl Williamson2014-01-221-4/+7
| | | | | | | | This adds and modifies various comments in several files, rewrapping some comments to occupy fewer lines but not exceed 79 columns. And fixes some indentation and other white space issues. It includes removing trailing white space in lines in regcomp.c. I didn't think it was worth making a commit for each file.
* Turn on read-only flag for some unchangeable inversion listsKarl Williamson2014-01-161-0/+3
| | | | | | These lists are read-only. Turning on the flag may allow some optimisations to be done, including some that may be added in the future.
* IDStart and IDCont no longer go out to diskKarl Williamson2014-01-091-2/+11
| | | | | | | These are the base names for various macros used in parsing identifiers. Prior to this patch, parsing a code point above Latin1 caused loading disk files. This patch causes all the information to be compiled into the Perl binary.
* isWORDCHAR_uni(), isDIGIT_utf8() etc no longer go out to diskKarl Williamson2014-01-091-11/+22
| | | | | | | Previous commits in this series have caused all the POSIX classes to be completely specified at C compile time. This allows us to revise the base function used by all these macros to use these definitions, avoiding reading them in from disk.
* utf8.c: Add commentKarl Williamson2014-01-091-1/+3
|
* utf8.c: Move a bunch of deprecated fcns to mathoms.cKarl Williamson2014-01-051-400/+0
| | | | | These functions will be out of the way in mathoms. There were a few that could not be moved, as-is, so I left them.
* utf8.c: Use existing macros instead of duplicate codeKarl Williamson2014-01-051-101/+38
| | | | | | In all these cases, there is an already existing macro that does exactly the same thing as the code that this commit replaces. No sense duplicating logic.
* Change some warnings in utf8n_to_uvchr()Karl Williamson2014-01-011-26/+26
| | | | | | | | | | | | | | | | This bottom level function decodes the first character of a UTF-8 string into a code point. It is discouraged from using it directly. This commit cleans up some of the warnings it can raise. Now, tests for malformations are done before any tests for other potential issues. One of those issues involves code points so large that they have never appeared in any official standard (the current standard has scaled back the highest acceptable code point from earlier versions). It is possible (though not done in CPAN) to warn and/or forbid these code points, while accepting smaller code points that are still above the legal Unicode maximum. The warning message for this now includes the code point if representable on the machine. Previously it always displayed raw bytes, which is what it still does for non-representable code points.