summaryrefslogtreecommitdiff
path: root/utf8.c
Commit message (Collapse)AuthorAgeFilesLines
* utf8.c: typos in podKarl Williamson2011-11-211-2/+2
|
* PATCH: [perl #32080] is_utf8_string() reads too farKarl Williamson2011-11-211-28/+30
| | | | | | This function and is_utf8_string_loclen() are modified to check before reading beyond the end of the string; and the pod for is_utf8_char() is modified to warn about the buffer overflow potential.
* utf8.c: typo in commentKarl Williamson2011-11-121-1/+1
|
* utf8.c: Skip extra function callsKarl Williamson2011-11-111-7/+3
| | | | | The function to_uni_fold() works without requiring conversion first to utf8.
* utf8.c: Add compiler hintKarl Williamson2011-11-111-1/+1
| | | | It's very rare that someone will be outputting these unusual code points
* utf8.c: Add and revise commentsKarl Williamson2011-11-111-6/+34
| | | | | I now understand swashes enough to document them better; nits in other comments
* utf8.c: Don't warn on \p{user-defined} for above-UnicodeKarl Williamson2011-11-101-13/+18
| | | | | | Perl has allowed user-defined properties to match above-Unicode code points, while falsely warning that it doesn't. This removes that warning.
* utf8.c: Handle swashes at UV_MAXKarl Williamson2011-11-101-0/+13
| | | | | | | The code assumed that there is a code point above the highest value we are looking at. That is true except when we are looking at the highest representable code point on the machine. A special case is needed for that.
* utf8.c: Fix swash handling under USE_MORE_BITSKarl Williamson2011-11-101-1/+1
| | | | | | On a 32 bit machine with USE_MORE_BITS, a UV is 64 bits, but STRLEN is 32 bits. A cast was missing during a bit complement that led to loss of 32 bits.
* utf8.c: Make swashes work close to UV_MAXKarl Williamson2011-11-091-1/+7
| | | | | | | | | | | | | | | | When a code point is to be checked if it matches a property, a swatch of the swash is read in. Typically this is a block of 64 code points that contain the one desired. A bit map is set for those 64 code points, apparently under the expectation that the program will desire code points near the original. However, it just adds 63 to the original code point to get the ending point of the block. When the original is so close to the maximum UV expressible on the platform, this will overflow. The patch is simply to check for overflow and if it happens use the max possible. A special case is still needed to handle the very maximum possible code point, and a future commit will deal with that.
* utf8.c: Faster latin1 foldingKarl Williamson2011-11-081-1/+47
| | | | | | | This adds a function similar to the ones for the other three case changing operations that works on latin1 characters only, and avoids having to go out to swashes. It changes to_uni_fold() and to_utf8_fold() to call it on the appropriate input
* utf8.c: Faster latin1 upper/title casingKarl Williamson2011-11-081-2/+81
| | | | | | | | | | | | | This creates a new function to handle upper/title casing code points in the latin1 range, and avoids using a swash to compute the case. This is because the correct values are compiled-in. And it calls this function when appropriate for both title and upper casing, in both utf8 and uni forms, Unlike the similar function for lower casing, it may make sense for this function to be called from outside utf8.c, but inside the core, so it is not static, but its name begins with an underscore.
* utf8.c: Expand use of refactored to_uni_lowerKarl Williamson2011-11-081-1/+10
| | | | | | | | The new function split out from to_uni_lower is now called when appropriate from to_utf8_lower. And to_uni_lower no longer calls to_utf8_lower, using the macro instead, saving a function call and duplicate work
* utf8.c: Refactor to_uni_lower()Karl Williamson2011-11-081-16/+27
| | | | | The portion that deals with Latin1 range characters is refactored into a separate (static) function, so that it can be called from more than one place.
* utf8.c: Refactor case-changing calls into macrosKarl Williamson2011-11-081-10/+20
| | | | Future commits will use these in additional places, so macroize
* utf8.c: Use proper Unicode property namesKarl Williamson2011-11-081-4/+4
| | | | | | | | | | | | | | There are five functions in utf8.c that look up Unicode maps--the case changing functions. They look up these maps under the names ToDigit, ToFold, ToLower, ToTitle, and ToUpper. The imminent expansion of Unicode::UCD to return the mappings for all properties creates a naming conflict, as three of those names are the same as other properties, Upper, Lower, and Title. It was an unfortunate choice of names originally. Now mktables has been changed to create a list of mapping properties that utf8_heavy.pl reads. It uses the official names of those properties, so change utf8.c to correspond.
* utf8.c: Don't use swash for to_uni_lower() latin1 callsKarl Williamson2011-10-171-2/+18
| | | | | The lowercase of latin-1 range code points is known to the perl core, so for those we can short-ciruit converting to utf8 and reading in a swash
* utf8.c: Add commentKarl Williamson2011-10-171-36/+36
|
* utf8.c: White space onlyKarl Williamson2011-10-171-34/+37
| | | | | Indent newly formed blocks, and reflow comments and code to fit in narrower space
* utf8.c: Add 'input pre-folded' flags to foldEQ_utf8_flagsKarl Williamson2011-10-171-0/+24
| | | | | This adds flags so that if one of the input strings is known to already have been folded, this routine can skip the (redundant) folding step.
* utf8.c: Add commentsKarl Williamson2011-10-171-1/+12
|
* mro.c: Correct utf8 and bytes concatenationFather Chrysostomos2011-10-061-0/+3
| | | | | | | | | | | | | | | | | | | The previous commit introduced some code that concatenates a pv on to an sv and then does SvUTF8_on on the sv if the pv was utf8. That can’t work if the sv was in Latin-1 (or single-byte) encoding and contained extra-ASCII characters. Nor can it work if bytes are appended to a utf8 sv. Both produce mangled utf8. There is apparently no function apart from sv_catsv that handle this. So I’ve modified sv_catpvn_flags to handle this if passed the SV_CATUTF8 (concatenating a utf8 pv) or SV_CATBYTES (cancatenating a byte pv) flag. This avoids the overhead of creating a new sv (in fact, sv_catsv even copies its rhs in some cases, so that would mean creating two new svs). It might even be worthwhile to redefine sv_catsv in terms of this....
* utf8.c: Add function to retrieve new _Perl_IDStart propKarl Williamson2011-10-011-0/+10
|
* Comment-only nitsKarl Williamson2011-10-011-3/+4
|
* utf8.c: Remove (mostly) redundant testKarl Williamson2011-10-011-4/+0
| | | | | | | | The swashes already have the underscore, so this test is redundant. It does save some time for this character to avoid having to go out and load the swash, but why just the underscore? In fact an earlier commit changed the macro that most people should use to access this function to not even call it for the underscore.
* Don't use swash to find cntrlsKarl Williamson2011-10-011-4/+10
| | | | | | | | | Unicode stability policy guarantees that no code points will ever be added to the control characters beyond those already in it. All such characters are in the Latin1 range, and so the Perl core already knows which ones those are, and so there is no need to go out to disk and create a swash for these.
* utf8.c: Use less confusing property nameKarl Williamson2011-10-011-1/+1
| | | | | The XPerlSpace is less confusing than SpacePerl (at least to me). It means take PerlSpace and extend it beyond ASCII.
* No need for swashes for properties that are ASCII-onlyKarl Williamson2011-10-011-3/+9
| | | | | | These three properties are restricted to being true only for ASCII characters. That information is compiled into Perl, so no need to create swashes for them.
* No need for swashes for computing if ASCIIKarl Williamson2011-10-011-4/+4
| | | | | This information is trivially computed via the macro, no need to go out to disk and store a swash for this.
* utf8.c: Call new function invlist_invert_prop()Karl Williamson2011-10-011-1/+1
| | | | | This new function is now potentially called. However, there is no data file or other circumstances which currently cause this path to get executed.
* Revise diagnostic textKarl Williamson2011-10-011-1/+1
| | | | I believe that the new wording is clearer than the older, which I wrote.
* utf8.c: White space onlyKarl Williamson2011-10-011-5/+5
| | | | This indents a block of code to match being in a newly created block
* utf8.c: Don't invert beyond-Unicode code pointsKarl Williamson2011-10-011-0/+14
| | | | | | | The Unicode properties are defined only on Unicode code points. In the past, this meant all property matches would fail for non-Unicode code points. However, starting with 5.15.1 some properties do succeed. This restores the previous behavior.
* [perl #99984] Incorrect errmsg with our $::éFather Chrysostomos2011-10-011-0/+2
| | | | | | | | Having PL_parser->error_count set to non-zero when utf8_heavy.pl tries to do() one of its swashes results in ‘Compilation error’ being placed in $@ during the do, even if it was successful. This patch sets the error_count to 0 before calling SWASHNEW, to prevent that. It uses SAVEI8, to make sure it is restored on scope exit.
* Convert some files from Latin-1 to UTF-8Keith Thompson2011-09-071-2/+2
|
* utf8.c: Accept INVERT_IT in swashKarl Williamson2011-07-031-1/+17
| | | | | | | This allows a swash to return a list, along with an extra key in the hash which says that the list should be inverted. A future commit will generate such keys.
* utf8.c: swash_to_invlist() handle EXTRASKarl Williamson2011-07-031-1/+78
| | | | | | | | | | | | | | | This function has not been able to handle what are called EXTRAS in its input. These are things like: !utf8::InHiragana -utf8::InKatakana +utf8::IsCn besides the normal list of ranges. This commit allows this function to handle all the same constructs as the regular swash input function, from which most of the new code was copied.
* Add flag to num groks to silence non-portable warningsKarl Williamson2011-07-031-4/+9
| | | | | | | Unicode inversion lists commonly will contain UV_MAX, which may trigger these warnings. Add a flag to suppress them to the numeric grok functions, which can be set by the code that is dealing with these lists
* Change inversion lists to SVsKarl Williamson2011-07-031-2/+2
| | | | | | The inversion list is an opaque object, currently implemented as an SV. Even if it ends up being an HV in the future, Nicholas is of the opinion that it should be presented to the world as an SV*.
* utf8.c: revise commentKarl Williamson2011-05-201-2/+3
|
* Fix some multi-char /i fold bugsKarl Williamson2011-05-191-4/+147
| | | | | | | | | | | | | | | | | | | | | | Consider U+FB05 and U+FB06. These both fold to 'st', and hence should match each other under /i. However, Unicode doesn't furnish a rule for this, and Perl hasn't been smart enought to figure it out. The bug that shows up is in constructs like "\x{fb06}" =~ /[^\x{fb05}]/i succeeding. Most of these instances also have a 'S' entry in Unicode's CaseFolding.txt, which avoids the problem (as mktables was earlier changed to include those in the generated table). But there were several code points that didn't. This patch changes utf8.c to look for these when constructing it's inverted list of case fold equivalents. An alternative would have been to change mktables instead to look for them and create synthetic rules. But, this is more general in case the function ends up being used for other things. I will change fold_grind.t to test for these in a separate commit.
* utf8.c: Remove soon-to-be-obsoleted commentKarl Williamson2011-05-191-16/+0
| | | | | This comment will no longer apply, as the code it talked about is moving into swash_init().
* utf8.c: Remove unnecessary temporaryKarl Williamson2011-05-191-5/+2
|
* utf8.c: "<" should be "<="Karl Williamson2011-05-191-1/+1
| | | | | | | av_len() is misnamed, and hence led me earlier to stop the loop one shy of what it should have been. No actual bugs were caused by this, but it could cause a duplicate entry in an array, which is searched linearly, hence a slight slowdown.
* utf8.c: Add _flags version of to_utf8_fold()Karl Williamson2011-05-031-7/+12
| | | | | | | | | | And also to_uni_fold(). The flag allows retrieving either simple or full folds. The interface is subject to change, so these are marked experimental and their names begin with underscore. The old versions are turned into macros calling the new versions with the correct extra parameter.
* utf8.c: silence compiler warningsDavid Mitchell2011-03-261-2/+2
| | | | | prefer foo("%s", fixedstr) over foo(fixedstr). One day someone might change fixedstr to include '%' characters.
* foldEQ_utf8(): Move rare tests out of main streamKarl Williamson2011-02-221-14/+12
| | | | | | | The code for handling locale can be moved entirely to the place where locale handling is done for the second string, as by that time we have processed the first string, and the second. Since we only succeed if both are atomic, single-bytes, we don't need to do the loop below.
* utf8.c: Fix setting wrong variableKarl Williamson2011-02-191-1/+1
| | | | This doesn't appear to actually break anything.
* foldEQ_utf8(): Add locale handlingKarl Williamson2011-02-191-5/+66
| | | | | | A new flag is now passable to this function to indicate to use locale rules for code points below 256. Unicode rules are still used for above 255. Folds which cross that boundary are disallowed.
* Subclass utf8 warnings so can turn off individuallyKarl Williamson2011-02-171-22/+31
|