summaryrefslogtreecommitdiff
path: root/utf8.c
Commit message (Collapse)AuthorAgeFilesLines
* Refactor \X regex handling to avoid a typical case table lookupKarl Williamson2012-08-281-3/+3
| | | | | | | | | Prior to this commit 98.4% of Unicode code points that went through \X had to be looked up to see if they begin a grapheme cluster; then looked up again to find that they didn't require special handling. This commit refactors things so only one look-up is required for those 98.4%. It changes the table generated by mktables to accomplish this, and hence the name of it, and references to it are changed to correspond.
* Prepare for Unicode 6.2Karl Williamson2012-08-261-3/+13
| | | | | | | | | | | This changes code to be able to handle Unicode 6.2, while continuing to handle all prevrious releases. The major change was a new definition of \X, which adds a property to its calculation. Unfortunately \X is hard-coded into regexec.c, and so has to revised whenever there is a change of this magnitude in Unicode, which fortunately isn't all that often. I refactored the code in mktables to make it easier next time there is a change like this one.
* regex: Speed up \X processingKarl Williamson2012-08-251-1/+31
| | | | | | | | | | | | | | | | For most Unicode releases, GCB=prepend matches absolutely nothing. And that appears to be the case going forward, as they added things to it, and removed them later based on field experience. An earlier commit has improved the performance of this significantly by using a binary search of an empty array instead of a swash hash. However, that search requires several layers of function calls to discover that it is empty, which this commit avoids. This patch will use whatever swash_init() returns unless it is empty, preserving backwards compatibility with older Unicode releases. But if it is empty, the routine sets things up so that future calls will always fail without further testing.
* utf8.c: indent in new block: White space-onlyKarl Williamson2012-08-251-2/+2
|
* utf8.c: Prefer binsearch over swash hash for small swashesKarl Williamson2012-08-251-7/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A binary swash is a hash of bitmaps used to cache the results of looking up if a code point matches a Unicode property or regex bracketed character class. An inversion list is a data structure that also holds information about what code points match a Unicode property or character class. It is implemented as an SV* to a sorted C array, and hence can be searched using a binary search. This patch converts to using a binary search of an inversion list instead of a hash look-up for inversion lists that are no more than 512 elements (9 iterations of the search loop). That number can be easily adjusted, if necessary. Theoretically, a hash is faster than a binary lookup over a very long period. So this may negatively impact long-running servers. But in the short run, where most programs reside, the binary search is significantly faster. A swash is filled as necessary as time goes on, caching each new distinct code point it is called with. If it is called with many, many such code points, its performance can degrade as collisions increase. A binary search does not have that drawback. However, most real-world scenarios do not have a program being called with huge numbers of distinct code points. Mostly, the program will be called with code points from just one or a few of the world's scripts, so will remain sparse. The bitmaps in a swash are each 64 bits long (except for ASCII, where it is 128). That means that when the swash is populated, a lookup of a single code point that hasn't been checked before will have to lookup the 63 adjoining code points as well, increasing its startup overhead. Of course, if one of those 63 code points is later accessed, no extra populate happens. This is a typical case where a languages code points are all near each other. The bottom line, though, is in the short term, this patch speeds up the processing of \X regex matching about 35-40%, with modern Korean (which has uniquely complicated \X processing) closer to 40%, and other scripts closer to 35%. The 512 boundary means that over 90% of the official Unicode properties are handled using binary search. I settled on that number by experimenting with several properties besides \X and with various powers-of-2 limits. Until I got that high, performance kept improving when the property went from being a swash to a binary search. \X improved even up to 2048, which encompasses 100% of the official Unicode properties. The implementation changes so that an inversion list instead of a swash is returned by swash_init() when the input flags allows it to do so, for all inversion lists shorter than the compiled in constant of 512 (actually <= 512). The other functions that access swashes have added intelligence to deal with an object of either type. Should someone in CPAN be using the public swash_init() interface, they will not see any difference, as the option to get an inversion list is not available to them.
* utf8.c: Bypass a subroutine wrapperKarl Williamson2012-08-251-1/+1
| | | | | We might as well call the core swash initialization, since we are the core here, since the public one merely wraps it.
* utf8.c: Add comment about speed-up attemptKarl Williamson2012-08-251-1/+2
| | | | | This might keep someone later from attempting the speedup which didn't actually help, so I didn't commit it
* utf8.c: Shorten hash key for speedKarl Williamson2012-08-251-4/+4
| | | | | | | | | | Experiments have shown that longer hash keys impact performance. See the thread at http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2012-08/msg00869.html This patch shortens a key used very frequently. There are other keys in this hash which are used frequently in some circumstances, but I expect to change to use fewer in the future, so am not changing them now
* utf8.c: collapse a function parameterKarl Williamson2012-08-251-5/+5
| | | | | | | Now that we have a flags parameter, we can get put this parameter as just another flag, giving a cleaner interface to this internal-only function. This also renames the flag parameter to <flag_p> to indicate it needs to be dereferenced.
* regexec.c: Use get method instead of internalsKarl Williamson2012-08-251-1/+7
| | | | | | | A new get method has been written to access the internals of a swash it's best to use it. This also moves the error checking to the method
* embed.fnc: Turn null wrapper function into macroKarl Williamson2012-08-251-1/+3
| | | | | This function only does something on EBCDIC platforms. On ASCII ones make it a macro, like similar ones to avoid useless function nesting
* utf8.c: Revise internal API of swash_init()Karl Williamson2012-08-251-28/+31
| | | | | | | | | | | This revises the API for the version of swash_init() that is usable by core Perl. The external interface is unaffected. There is now a flags parameter to allow for future growth. And the core internal-only function that returns if a swash has a user-defined property in it or not has been removed. This information is now returned via the new flags parameter upon initialization, and is unavailable afterwards. This is to prepare for the flexibility to change the swash that is needed in future commits.
* Comment out unused functionKarl Williamson2012-08-251-0/+2
| | | | | | In looking at \X handling, I noticed that this function which is intended for use in it, actually isn't used. This function may someday be useful, so I'm leaving the source in.
* utf8.c: Speed up \X processing of KoreanKarl Williamson2012-08-251-2/+48
| | | | | | | | | | | | | \X matches according to a complicated pattern that is hard-coded in regexec.c. Part of that pattern involves checking if a code point is a component of a Hangul Syllable or not. For Korean code points, this involves checking against multiple tables. It turns out that two of those tables are arranged so that the checks for them can be done via an arithmetic expression; Unicode publishes algorithms for determining various characteristics based on their very structured ordering. This patch converts the routines that check these two tables to instead use the arithmetic expression.
* Add empty inline_invlist.cKarl Williamson2012-08-251-0/+1
| | | | | | | | | | This will be used for things need to handle inversion lists in the three files that currently use them. I'm putting this in a separate hdr, because inversion lists are very internal-only, so should not be grouped in with things that there is an external API for. It is a dot-c file so that the functions can continue to be declared with embed.fnc, and porting/args_assert.t will continue to work, as it looks only in .c files.
* Omnibus removal of register declarationsKarl Williamson2012-08-181-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This removes most register declarations in C code (and accompanying documentation) in the Perl core. Retained are those in the ext directory, Configure, and those that are associated with assembly language. See: http://stackoverflow.com/questions/314994/whats-a-good-example-of-register-variable-usage-in-c which says, in part: There is no good example of register usage when using modern compilers (read: last 10+ years) because it almost never does any good and can do some bad. When you use register, you are telling the compiler "I know how to optimize my code better than you do" which is almost never the case. One of three things can happen when you use register: The compiler ignores it, this is most likely. In this case the only harm is that you cannot take the address of the variable in the code. The compiler honors your request and as a result the code runs slower. The compiler honors your request and the code runs faster, this is the least likely scenario. Even if one compiler produces better code when you use register, there is no reason to believe another will do the same. If you have some critical code that the compiler is not optimizing well enough your best bet is probably to use assembler for that part anyway but of course do the appropriate profiling to verify the generated code is really a problem first.
* utf8.c: Add a get_() method to hide internal detailsKarl Williamson2012-07-241-0/+14
| | | | | | This should have been written this way to begin with (I'm the culprit). But we should have a method so another routine doesn't have to know the internal details.
* utf8.c: Add info to commented-out -DU linesKarl Williamson2012-07-241-3/+3
| | | | This proved useful when I recently needed to use these for debugging
* Only generate above-Uni warning for \p{}, \P{}Karl Williamson2012-07-191-18/+0
| | | | | | | | | | | This warning was being generated inappropriately during some internal operations, such as parsing a program; spotted by Tom Christiansen. The solution is to move the check for this situation out of the common code, and into the code where just \p{} and \P{} are handled. As mentioned in the commit's perldelta, there remains a bug [perl #114148], where no warning gets generated when it should
* utf8.c: Create API so internals can be hiddenKarl Williamson2012-07-191-0/+13
| | | | | | | This creates a function to hide some of the internal details of swashes from the regex engine, which is the only authorized user, enforced through #ifdefs in embed.fnc. These work closely together, but it's best to have a clean interface.
* handy.h: Fix isBLANK_uni and isBLANK_utf8Karl Williamson2012-06-291-0/+24
| | | | | | | | | | These macros have never worked outside the Latin1 range, so this extends them to work. There are no tests I could find for things in handy.h, except that many of them are called all over the place during the normal course of events. This commit adds a new file for such testing, containing for now only with a few tests for the isBLANK's
* Use assertions for /* NOT REACHED */Father Chrysostomos2012-06-151-1/+1
| | | | to make sure it really is never reached.
* utf8.c: White-space onlyKarl Williamson2012-06-071-2/+2
|
* utf8.c: Refactor a portion of to_utf8_case()Karl Williamson2012-06-071-4/+11
| | | | | | | This routine can never return 0, as if there is no case mapping, the input is used instead. The code point for that input has already been derived earlier in the function, so it doesn't have to be recalculated. And, rearrange the order of things slightly.
* utf8.c: Avoid some extra workKarl Williamson2012-06-071-3/+5
| | | | | In the case changed, the output is the input, so can just Copy it instead of re-deriving it.
* utf8.c: Add, revise commentsKarl Williamson2012-06-071-2/+2
|
* utf8.c: Use new internal properties for \XKarl Williamson2012-06-021-7/+7
| | | | | | These new properties are generated for all Unicode releases, and so \X can now work on all Unicodes, not just the ones where Unicode has defined them.
* update the editor hints for spaces, not tabsRicardo Signes2012-05-291-2/+2
| | | | | This updates the editor hints in our files for Emacs and vim to request that tabs be inserted as spaces.
* utf8.c: Add nomix-ASCII option to to_fold functionsKarl Williamson2012-05-221-11/+72
| | | | | | | | | | Under /iaa regex matching, folds that cross the ASCII/non-ASCII boundary are prohibited. This changes _to_uni_fold_flags() and _to_utf8_fold_flags() functions to take a new flag which, when set, tells them to not accept such folds. This allows us to later move the intelligence for handling this situation to these centralized functions.
* utf8.c: Add assertionKarl Williamson2012-05-221-0/+2
|
* utf8.c: Re-order if branches for speedKarl Williamson2012-05-221-11/+11
| | | | | | | Probably the C optimizer does this anyway, but do the uncomplicated test before the (mutually exclusive) complicated test (though the complications are hidden in a macro). The new first test is a pre-requisite for the new 2nd test anyway.
* utf8.c: Add commentKarl Williamson2012-05-221-0/+4
|
* utf8n_to_uvuni(): Add a few compiler hintsKarl Williamson2012-05-221-6/+6
| | | | | Tell the compiler that malformed input is not likely, so it can optimize accordingly.
* utf8.c: Skip extraneous function callKarl Williamson2012-05-221-1/+1
| | | | | This eliminates an intermediate function call by calling the base level one directly.
* utf8.c: Remove unnecessary validationKarl Williamson2012-05-221-8/+31
| | | | | These two functions are to be called only on strings known to be valid, so we can skip the validation.
* utf8.c: Extra branch to avoid others in the typical caseKarl Williamson2012-05-221-1/+4
| | | | | | This test eliminates all code points less than U+D800 from having to be checked more than once, at the expense of an extra test for code points that are larger
* utf8n_to_uvuni(): Fix broken malformation interactionsKarl Williamson2012-05-011-3/+12
| | | | | | | | | | | All code points whose UTF-8 representations start with a byte containing either \xFE or \xFF are considered problematic because they are not portable. There are many such code points that are too large to represent on a 32 or even a 64 bit platform. Commit eb83ed87110e41de6a4cd4463f75df60798a9243 failed to properly catch overflow when the input flags to this function say to warn on, but otherwise accept FE and FF sequences. Now overflow is checked for unconditionally.
* is_utf8_char_slow(): Avoid accepting overlongsKarl Williamson2012-04-261-33/+5
| | | | | | | There are possible overlong sequences that this function blindly accepts. Instead of developing the code to figure this out, turn this function into a wrapper for utf8n_to_uvuni() which already has this check.
* perlapi: Update for changes in utf8 decodingKarl Williamson2012-04-261-14/+36
|
* utf8.c: White-space onlyKarl Williamson2012-04-261-5/+5
| | | | This outdents to account for the removal of a surrounding block.
* utf8.c: refactor utf8n_to_uvuni()Karl Williamson2012-04-261-141/+255
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The prior version had a number of issues, some of which have been taken care of in previous commits. The goal when presented with malformed input is to consume as few bytes as possible, so as to position the input for the next try to the first possible byte that could be the beginning of a character. We don't want to consume too few bytes, so that the next call has us thinking that what is the middle of a character is really the beginning; nor do we want to consume too many, so as to skip valid input characters. (This is forbidden by the Unicode standard because of security considerations.) The previous code could do both of these under various circumstances. In some cases it took as a given that the first byte in a character is correct, and skipped looking at the rest of the bytes in the sequence. This is wrong when just that first byte is garbled. We have to look at all bytes in the expected sequence to make sure it hasn't been prematurely terminated from what we were led to expect by that first byte. Likewise when we get an overflow: we have to keep looking at each byte in the sequence. It may be that the initial byte was garbled, so that it appeared that there was going to be overflow, but in reality, the input was supposed to be a shorter sequence that doesn't overflow. We want to have an error on that shorter sequence, and advance the pointer to just beyond it, which is the first position where a valid character could start. This fixes a long-standing TODO from an externally supplied utf8 decode test suite. And, the old algorithm for finding overflow failed to detect it on some inputs. This was spotted by Hugo van der Sanden, who suggested the new algorithm that this commit uses, and which should work in all instances. For example, on a 32-bit machine, any string beginning with "\xFE" and having the next byte be either "\x86" or \x87 overflows, but this was missed by the old algorithm. Another bug was that the code was careless about what happens when a malformation occurs that the input flags allow. For example, a sequence should not start with a continuation byte. If that malformation is allowed, the code pretended it is a start byte and extracts the "length" of the sequence from it. But pretending it is a start byte is not the same thing as it actually being a start byte, and so there is no extractable length in it, so the number that this code thought was "length" was bogus. Yet another bug fixed is that if only the warning subcategories of the utf8 category were turned on, and not the entire utf8 category itself, warnings were not raised that should have been. And yet another change is that given malformed input with warnings turned off, this function used to return whatever it had computed so far, which is incomplete or erroneous garbage. This commit changes to return the REPLACEMENT CHARACTER instead. Thanks to Hugo van der Sanden for reviewing and finding problems with an earlier version of these commits
* utf8n_to_uvuni: Avoid reading outside of bufferKarl Williamson2012-04-261-4/+4
| | | | | | Prior to this patch, if the first byte of a UTF-8 sequence indicated that the sequence occupied n bytes, but the input parameters indicated that fewer were available, all n were attempted to be read
* utf8.c: Clarify and correct podKarl Williamson2012-04-261-6/+6
| | | | Some of these were spotted by Hugo van der Sanden
* utf8.c: Use macros instead of if..else.. sequenceKarl Williamson2012-04-261-12/+2
| | | | | | | | There are two existing macros that do the job that this longish sequence does. One, UTF8SKIP(), does an array lookup and is very likely to be in the machine's cache as it is used ubiquitously when processing UTF-8. The other is a simple test and shift. These simplify the code and should speed things up as well.
* PATCH: [perl #112530] Panic with inversion listsKarl Williamson2012-04-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | The code assumed that all property definitions would be well-formed, meaning, in part, that they would be numerically sorted by code point, with each range disjoint from all others. So, the code was just appending each range as it is found to the inversion list it is building. This assumption is true for all definitions generated by mktables, but it might not be true for user-defined ones. The solution is merely to change from calling the function that appends to instead call the existing function that handles the more general case. However, that function was not previously used outside the file it was defined in, so must now be made public. Also, this whole interface is considered volatile, so the names of the public functions in it begin with an underscore to further discourage XS writers from using them. Therefore the more general add function is renamed to begin with an underscore. And, the append function is no longer needed outside the file it is defined in, so again to keep XS writers from using it, this commit makes it static.
* PATCH: [perl #111338] Warnings in utf8 subcategories do nothing in isolationKarl Williamson2012-04-171-1/+1
| | | | | This was the result of assuming that these would not be on unless the main category was also on.
* utf8.c: Add back inadvertently deleted pod textKarl Williamson2012-03-301-0/+3
| | | | | This was deleted by mistake in commit 4b88fb76efce8c436e63b907c9842345d4fa77c7
* Use remove more uses of utf8_to_uvchr()Karl Williamson2012-03-301-2/+2
| | | | | Commit 4b88fb76efce8c436e63b907c9842345d4fa77c7 missed 2 occurrences of this, one of which is #ifdef'd out.
* Deprecate utf8_to_uvchr() and utf8_to_uvuni()Karl Williamson2012-03-191-4/+8
| | | | | | These functions can read beyond the end of their input strings if presented with malformed UTF-8 input. Perl core code has been converted to use other functions instead of these.
* Use the new utf8 to code point functionsKarl Williamson2012-03-191-21/+25
| | | | | These functions should be used in preference to the old ones which can read beyond the end of the input string.