summaryrefslogtreecommitdiff
path: root/utf8.h
Commit message (Collapse)AuthorAgeFilesLines
* Use new case changing macrosKarl Williamson2013-05-201-2/+2
| | | | | The previous commit added macros to do some case changing. This commit uses them in the core, where appropriate.
* Add, fix commentsKarl Williamson2013-02-251-2/+5
|
* utf8.h, utfebcdic.h: Add, fix commentsKarl Williamson2013-02-151-0/+6
|
* utf8.h: Add commentsKarl Williamson2013-01-161-8/+41
| | | | This also reorders one #define to be closer to a related one.
* Create deprecated fncs to replace to-be-removed macrosKarl Williamson2013-01-121-3/+0
| | | | | | | | | | | These macros should not be used as they are prone to misuse. There are no occurrences of them in CPAN. The single use of either of them in core has recently been removed (commit 8d40577bdbdfa85ed3293f84bf26a313b1b92f55), because it was a misuse. Instead code should use isIDFIRST_lazy_if or isWORDCHAR_lazy_if (isALNUM_lazy_if is also available, but can be confused with the Posix alnum, which it doesn't mean).
* Add isWORDCHAR_lazy_if() macroKarl Williamson2012-12-311-3/+4
| | | | | This is a synonym for the existing isALNUM_lazy_if(), which can be confused with meaning the Posix alnum instead of the Perl \w.
* utf8.h: Make sure char* is cast to U8* for unsigned comparisonKarl Williamson2012-12-231-4/+5
| | | | | | If a char* is passed prior to this commit, an above-ASCII char could have been considered negative instead of positive, and thus screwed up these tests
* utf8.h: Parenthesize macro parameterKarl Williamson2012-12-231-1/+1
| | | | | This apparently hasn't caused us problems, but all uses of a macro paramenter should be parenthesized to prevent surprises.
* make regcharclass generate submacros if necessary to keep them shortYves Orton2012-12-061-9/+8
| | | | | | Some compilers can't handle unexpanded macros longer than something like 8000 characters. So we split up long ones into sub macros to work around the problem
* utf8.h: Add macro that handled malformed 2-byte UTF-8Karl Williamson2012-11-111-1/+9
| | | | | | | | | | The macro used traditionally to see if there is a two-byte UTF-8 sequence doesn't make sure that there actually is a second byte available; it only checks if the first byte indicates that there is. This adds a macro that is safe in the face of malformed UTF-8. I inspected the existing calls in the core to the unsafe macro, and I believe that none of them need to be converted to the safe version.
* utf8.h: Add guard against recursive #includeKarl Williamson2012-10-161-0/+5
| | | | A future commit will #include this from another header
* utf8.h: Correct some values for EBCDICKarl Williamson2012-10-141-15/+19
| | | | | | It occurred to me that EBCDIC has different maximums for the number of bytes a character can occupy. This moves the definition in utf8.h to within an #ifndef EBCDIC, and adds the correct values to utfebcdic.h
* utf8.h: Add macro to test if UTF8 code point isn't Latin1Karl Williamson2012-09-161-0/+1
|
* utf8.h: Use machine generated IS_UTF8_CHAR()Karl Williamson2012-09-131-95/+47
| | | | | | | | | | | | | | | | This takes the output of regen/regcharclass.pl for all the 1-4 byte UTF8-representations of Unicode code points, and replaces the current hand-rolled definition there. It does this only for ASCII platforms, leaving EBCDIC to be machine generated when run on such a platform. I would rather have both versions to be regenerated each time it is needed to save an EBCDIC dependency, but it takes more than 10 minutes on my computer to process the 2 billion code points that have to be checked for on ASCII platforms, and currently t/porting/regen.t runs this program every times; and that slow down would be unacceptable. If this is ever run under EBCDIC, the macro should be machine computed (very slowly). So, even though there is an EBCDIC dependency, it has essentially been solved.
* utf8.h: Remove some EBCDIC dependenciesKarl Williamson2012-09-131-82/+11
| | | | | | | | | | | regen/regcharclass.pl has been enhanced in previous commits so that it generates as good code as these hand-defined macro definitions for various UTF-8 constructs. And, it should be able to generate EBCDIC ones as well. By using its definitions, we can remove the EBCDIC dependencies for them. It is quite possible that the EBCDIC versions were wrong, since they have never been tested. Even if regcharclass.pl has bugs under EBCDIC, it is easier to find and fix those in one place, than all the sundry definitions.
* utf8.h: Save a branch in a macroKarl Williamson2012-09-131-1/+1
| | | | | By adding a mask, we can save a branch. The two expressions match the exact same code points.
* utf8.h: White-space onlyKarl Williamson2012-09-131-6/+8
| | | | This reflows some lines to fit into 80 columns
* utf8.h: Correct improper EBCDIC conversionKarl Williamson2012-09-131-4/+8
| | | | | | | | | These macros were incorrect for EBCDIC. The relationships are based on I8, the intermediate-utf8 defined for UTF-EBCDIC, not the final encoding. I was the culprit who did this orginally; I was confused by the names of the conversion macros. I'm adding names that are clearer to me; which have already been defined in utfebcdic.h, but weren't defined for non-EBCDIC platforms.
* Remove some EBCDIC dependenciesKarl Williamson2012-09-131-8/+10
| | | | | | A new regen'd header file has been created that contains the native values for certain characters. By using those macros, we can eliminate EBCDIC dependencies.
* utf8.c: Prefer binsearch over swash hash for small swashesKarl Williamson2012-08-251-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A binary swash is a hash of bitmaps used to cache the results of looking up if a code point matches a Unicode property or regex bracketed character class. An inversion list is a data structure that also holds information about what code points match a Unicode property or character class. It is implemented as an SV* to a sorted C array, and hence can be searched using a binary search. This patch converts to using a binary search of an inversion list instead of a hash look-up for inversion lists that are no more than 512 elements (9 iterations of the search loop). That number can be easily adjusted, if necessary. Theoretically, a hash is faster than a binary lookup over a very long period. So this may negatively impact long-running servers. But in the short run, where most programs reside, the binary search is significantly faster. A swash is filled as necessary as time goes on, caching each new distinct code point it is called with. If it is called with many, many such code points, its performance can degrade as collisions increase. A binary search does not have that drawback. However, most real-world scenarios do not have a program being called with huge numbers of distinct code points. Mostly, the program will be called with code points from just one or a few of the world's scripts, so will remain sparse. The bitmaps in a swash are each 64 bits long (except for ASCII, where it is 128). That means that when the swash is populated, a lookup of a single code point that hasn't been checked before will have to lookup the 63 adjoining code points as well, increasing its startup overhead. Of course, if one of those 63 code points is later accessed, no extra populate happens. This is a typical case where a languages code points are all near each other. The bottom line, though, is in the short term, this patch speeds up the processing of \X regex matching about 35-40%, with modern Korean (which has uniquely complicated \X processing) closer to 40%, and other scripts closer to 35%. The 512 boundary means that over 90% of the official Unicode properties are handled using binary search. I settled on that number by experimenting with several properties besides \X and with various powers-of-2 limits. Until I got that high, performance kept improving when the property went from being a swash to a binary search. \X improved even up to 2048, which encompasses 100% of the official Unicode properties. The implementation changes so that an inversion list instead of a swash is returned by swash_init() when the input flags allows it to do so, for all inversion lists shorter than the compiled in constant of 512 (actually <= 512). The other functions that access swashes have added intelligence to deal with an object of either type. Should someone in CPAN be using the public swash_init() interface, they will not see any difference, as the option to get an inversion list is not available to them.
* utf8.c: collapse a function parameterKarl Williamson2012-08-251-0/+1
| | | | | | | Now that we have a flags parameter, we can get put this parameter as just another flag, giving a cleaner interface to this internal-only function. This also renames the flag parameter to <flag_p> to indicate it needs to be dereferenced.
* embed.fnc: Turn null wrapper function into macroKarl Williamson2012-08-251-0/+1
| | | | | This function only does something on EBCDIC platforms. On ASCII ones make it a macro, like similar ones to avoid useless function nesting
* utf8.c: Revise internal API of swash_init()Karl Williamson2012-08-251-0/+3
| | | | | | | | | | | This revises the API for the version of swash_init() that is usable by core Perl. The external interface is unaffected. There is now a flags parameter to allow for future growth. And the core internal-only function that returns if a swash has a user-defined property in it or not has been removed. This information is now returned via the new flags parameter upon initialization, and is unavailable afterwards. This is to prepare for the flexibility to change the swash that is needed in future commits.
* utf8.h, regcomp.c: Use mnemonics for Unicode charsKarl Williamson2012-07-241-0/+3
| | | | | | Add a mnemonic definition for these three code points. They are currently used in only one place, but future commits will use them elsewhere.
* update the editor hints for spaces, not tabsRicardo Signes2012-05-291-2/+2
| | | | | This updates the editor hints in our files for Emacs and vim to request that tabs be inserted as spaces.
* utf8.c: Add nomix-ASCII option to to_fold functionsKarl Williamson2012-05-221-1/+2
| | | | | | | | | | Under /iaa regex matching, folds that cross the ASCII/non-ASCII boundary are prohibited. This changes _to_uni_fold_flags() and _to_utf8_fold_flags() functions to take a new flag which, when set, tells them to not accept such folds. This allows us to later move the intelligence for handling this situation to these centralized functions.
* utf8.h, pp.c: Add UTF8_IS_REPLACEMENT macro, and use itKarl Williamson2012-05-221-0/+10
| | | | | This should speed things up slightly, as it looks directly at the UTF-8 source, instead of having to decode it first.
* utf8.h: Simplify expressionsKarl Williamson2012-05-221-28/+8
| | | | | | | | | These expressions, while valid, are overly complicated in order to make it easy to separate out problematic code points in the future, such as surrogates. But we made a decision in 5.12 to not go in that direction, but to accept such problematic code points in general. I haven't heard any cause to regret that decision; if we ever want to go back, the blame log will easily allow us to.
* utf8.h: Comment improvementes, white-spaceKarl Williamson2012-05-221-17/+36
|
* utf8.c: refactor utf8n_to_uvuni()Karl Williamson2012-04-261-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The prior version had a number of issues, some of which have been taken care of in previous commits. The goal when presented with malformed input is to consume as few bytes as possible, so as to position the input for the next try to the first possible byte that could be the beginning of a character. We don't want to consume too few bytes, so that the next call has us thinking that what is the middle of a character is really the beginning; nor do we want to consume too many, so as to skip valid input characters. (This is forbidden by the Unicode standard because of security considerations.) The previous code could do both of these under various circumstances. In some cases it took as a given that the first byte in a character is correct, and skipped looking at the rest of the bytes in the sequence. This is wrong when just that first byte is garbled. We have to look at all bytes in the expected sequence to make sure it hasn't been prematurely terminated from what we were led to expect by that first byte. Likewise when we get an overflow: we have to keep looking at each byte in the sequence. It may be that the initial byte was garbled, so that it appeared that there was going to be overflow, but in reality, the input was supposed to be a shorter sequence that doesn't overflow. We want to have an error on that shorter sequence, and advance the pointer to just beyond it, which is the first position where a valid character could start. This fixes a long-standing TODO from an externally supplied utf8 decode test suite. And, the old algorithm for finding overflow failed to detect it on some inputs. This was spotted by Hugo van der Sanden, who suggested the new algorithm that this commit uses, and which should work in all instances. For example, on a 32-bit machine, any string beginning with "\xFE" and having the next byte be either "\x86" or \x87 overflows, but this was missed by the old algorithm. Another bug was that the code was careless about what happens when a malformation occurs that the input flags allow. For example, a sequence should not start with a continuation byte. If that malformation is allowed, the code pretended it is a start byte and extracts the "length" of the sequence from it. But pretending it is a start byte is not the same thing as it actually being a start byte, and so there is no extractable length in it, so the number that this code thought was "length" was bogus. Yet another bug fixed is that if only the warning subcategories of the utf8 category were turned on, and not the entire utf8 category itself, warnings were not raised that should have been. And yet another change is that given malformed input with warnings turned off, this function used to return whatever it had computed so far, which is incomplete or erroneous garbage. This commit changes to return the REPLACEMENT CHARACTER instead. Thanks to Hugo van der Sanden for reviewing and finding problems with an earlier version of these commits
* utf8.h: Use correct definition of start byteKarl Williamson2012-04-261-3/+1
| | | | | | The previous definition allowed for (illegal) overlongs. The uses of this macro in the core assume that it is accurate. The inacurracy can cause such code to fail.
* utf8.h: Use correct UTF-8 downgradeable definitionChristian Hansen2012-04-261-1/+3
| | | | | | | | | Previously, the macro changed by this commit would accept overlong sequences. This patch was changed by the committer to to include EBCDIC changes; and in the non-EBCDIC case, to save a test, by using a mask instead, in keeping with the prior version of the code
* utf8.h: Restore macro defn incorrectly trashed earlierKarl Williamson2012-01-291-1/+3
| | | | | | Commit 66cbab2c91fca8c9abc65a7231a053898208efe3 changed the definition of IN_UNI_8_BIT, but in so doing, lost the 2nd line of the macro; and I did not catch it. Tests will be added shortly.
* Add :not_characters parameter to 'use locale'Karl Williamson2012-01-211-2/+2
| | | | | This adds the parameter handling, tests, and documentation for this new feature which allows locale and Unicode to play well with each other.
* Bump several file copyright datesSteffen Schwigon2012-01-191-1/+2
| | | | | | | Sync copyright dates with actual changes according to git history. [Plus run regen_perly.h to update the SHA-256 checksums, and regen/regcharclass.pl to update regcharclass.h]
* utf8.c: Allow Changed behavior of utf8 under localeKarl Williamson2011-12-151-1/+9
| | | | | | | | | | This changes the 4 case changing functions to take extra parameters to specify if the utf8 string is to be processed under locale rules when the code points are < 256. The current functions are changed to macros that call the new versions so that current behavior is unchanged. An additional, static, function is created that makes sure that the 255/256 boundary is not crossed during the case change.
* utf8.h: Add missing parensKarl Williamson2011-11-211-1/+1
| | | | | These weren't caught because it only is compiled on an EBCDIC platform, and I had to fake it to force the compilation
* utf8.h: define IS_UTF8_CHAR for EBCDICKarl Williamson2011-11-211-3/+53
| | | | | This is based on my eyeballing a file I had generated of the encodings for Unicode code points, so could be wrong. It does compile
* utf8.h: White space onlyKarl Williamson2011-11-211-16/+16
| | | | This indents for clarity with the surrounding #if, and #end.
* utf8.h: clarify commentKarl Williamson2011-11-101-1/+2
|
* utf8.c: Add 'input pre-folded' flags to foldEQ_utf8_flagsKarl Williamson2011-10-171-0/+2
| | | | | This adds flags so that if one of the input strings is known to already have been folded, this routine can skip the (redundant) folding step.
* utf8.h: Revise formal parameter name for clarityKarl Williamson2011-10-011-7/+7
|
* utf8.h: Remove redundant checksKarl Williamson2011-10-011-6/+7
| | | | | | The macros that these call have been revised to do the same checks, enhanced to not call the functions for all of Latin1, not just ASCII as these did. So the tests here are redundant.
* utf8.c: Add _flags version of to_utf8_fold()Karl Williamson2011-05-031-0/+3
| | | | | | | | | | And also to_uni_fold(). The flag allows retrieving either simple or full folds. The interface is subject to change, so these are marked experimental and their names begin with underscore. The old versions are turned into macros calling the new versions with the correct extra parameter.
* utf8.h: Add #defineKarl Williamson2011-03-201-0/+1
|
* utf8.h: A fold buffer needs to hold any utf8 charKarl Williamson2011-03-191-2/+3
| | | | It can't just be large enough to hold the Unicode subset.
* Add #defines for 2 Latin1 charsKarl Williamson2011-02-271-0/+2
| | | | | These will be used in a future commit; the ordinals are different on EBCDIC vs. ASCII
* Move some #definesKarl Williamson2011-02-271-0/+2
| | | | | These were defined in a .c, but now there is need for them in another .c, so move them to a header.
* Free up bit in ANYOF flagsKarl Williamson2011-02-251-1/+1
| | | | | | | | | | | | | | | | This is the foundation for fixing the regression RT #82610. My analysis was wrong that two bits could be shared, at least not without further work. This changes to use a different mechanism to pass needed information to regexec.c so that another bit can be freed up and, in a later commit, the two bits can become unshared again. The bit that is freed up is ANYOF_UTF8, which basically said there is something that is matched outside the ANYOF bitmap, and requires the target string to be in utf8. This changes things so the existence of something besides the bitmap indicates this, and so no flag is needed. The flag bit ANYOF_NONBITMAP_NON_UTF8 remains to indicate that there is something that should be matched outside the bitmap even if the target string isn't in utf8.
* foldEQ_utf8(): Add locale handlingKarl Williamson2011-02-191-0/+1
| | | | | | A new flag is now passable to this function to indicate to use locale rules for code points below 256. Unicode rules are still used for above 255. Folds which cross that boundary are disallowed.