summaryrefslogtreecommitdiff
path: root/utf8.c
Commit message (Collapse)AuthorAgeFilesLines
* utf8.c: Accept INVERT_IT in swashKarl Williamson2011-07-031-1/+17
| | | | | | | This allows a swash to return a list, along with an extra key in the hash which says that the list should be inverted. A future commit will generate such keys.
* utf8.c: swash_to_invlist() handle EXTRASKarl Williamson2011-07-031-1/+78
| | | | | | | | | | | | | | | This function has not been able to handle what are called EXTRAS in its input. These are things like: !utf8::InHiragana -utf8::InKatakana +utf8::IsCn besides the normal list of ranges. This commit allows this function to handle all the same constructs as the regular swash input function, from which most of the new code was copied.
* Add flag to num groks to silence non-portable warningsKarl Williamson2011-07-031-4/+9
| | | | | | | Unicode inversion lists commonly will contain UV_MAX, which may trigger these warnings. Add a flag to suppress them to the numeric grok functions, which can be set by the code that is dealing with these lists
* Change inversion lists to SVsKarl Williamson2011-07-031-2/+2
| | | | | | The inversion list is an opaque object, currently implemented as an SV. Even if it ends up being an HV in the future, Nicholas is of the opinion that it should be presented to the world as an SV*.
* utf8.c: revise commentKarl Williamson2011-05-201-2/+3
|
* Fix some multi-char /i fold bugsKarl Williamson2011-05-191-4/+147
| | | | | | | | | | | | | | | | | | | | | | Consider U+FB05 and U+FB06. These both fold to 'st', and hence should match each other under /i. However, Unicode doesn't furnish a rule for this, and Perl hasn't been smart enought to figure it out. The bug that shows up is in constructs like "\x{fb06}" =~ /[^\x{fb05}]/i succeeding. Most of these instances also have a 'S' entry in Unicode's CaseFolding.txt, which avoids the problem (as mktables was earlier changed to include those in the generated table). But there were several code points that didn't. This patch changes utf8.c to look for these when constructing it's inverted list of case fold equivalents. An alternative would have been to change mktables instead to look for them and create synthetic rules. But, this is more general in case the function ends up being used for other things. I will change fold_grind.t to test for these in a separate commit.
* utf8.c: Remove soon-to-be-obsoleted commentKarl Williamson2011-05-191-16/+0
| | | | | This comment will no longer apply, as the code it talked about is moving into swash_init().
* utf8.c: Remove unnecessary temporaryKarl Williamson2011-05-191-5/+2
|
* utf8.c: "<" should be "<="Karl Williamson2011-05-191-1/+1
| | | | | | | av_len() is misnamed, and hence led me earlier to stop the loop one shy of what it should have been. No actual bugs were caused by this, but it could cause a duplicate entry in an array, which is searched linearly, hence a slight slowdown.
* utf8.c: Add _flags version of to_utf8_fold()Karl Williamson2011-05-031-7/+12
| | | | | | | | | | And also to_uni_fold(). The flag allows retrieving either simple or full folds. The interface is subject to change, so these are marked experimental and their names begin with underscore. The old versions are turned into macros calling the new versions with the correct extra parameter.
* utf8.c: silence compiler warningsDavid Mitchell2011-03-261-2/+2
| | | | | prefer foo("%s", fixedstr) over foo(fixedstr). One day someone might change fixedstr to include '%' characters.
* foldEQ_utf8(): Move rare tests out of main streamKarl Williamson2011-02-221-14/+12
| | | | | | | The code for handling locale can be moved entirely to the place where locale handling is done for the second string, as by that time we have processed the first string, and the second. Since we only succeed if both are atomic, single-bytes, we don't need to do the loop below.
* utf8.c: Fix setting wrong variableKarl Williamson2011-02-191-1/+1
| | | | This doesn't appear to actually break anything.
* foldEQ_utf8(): Add locale handlingKarl Williamson2011-02-191-5/+66
| | | | | | A new flag is now passable to this function to indicate to use locale rules for code points below 256. Unicode rules are still used for above 255. Folds which cross that boundary are disallowed.
* Subclass utf8 warnings so can turn off individuallyKarl Williamson2011-02-171-22/+31
|
* handy.h: isIDFIRST_utf8() changed to use XIDStartKarl Williamson2011-02-171-0/+25
| | | | | | | | | | Previously this used a home-grown definition of an identifier start, stemming from a bug in some early Unicode versions. This led to some problems, fixed by #74022. But the home-grown solution did not track Unicode, and allowed for characters, like marks, to begin words when they shouldn't. This change brings this macro into compliance with Unicode going-forward.
* foldEQ_utf8_flags: Add no-mixing ASCII optionKarl Williamson2011-02-141-2/+25
| | | | | If this option is set, any match that has a non-ASCII character that has an ASCII character in its fold will not match that fold.
* foldEQ_utf8: Add version with flags parameterKarl Williamson2011-02-141-2/+2
| | | | | The parameter doesn't do anything yet. The old version becomes a macro calling the new version with 0 as the flags.
* Silence compile warnings before uni tables builtKarl Williamson2011-02-061-1/+11
| | | | | | | | | | The recent move of Unicode folding to the compilation phase caused spurious warnings during the miniperl build phase of Perl itself before the Unicode tables get built. Before the tables are built, Perl is unable to know about the Unicode semantics (it has ASCII/Latin1 hard-coded in), but was still trying to access the tables. Now, it checks and if the tables aren't present uses just the hard-coded ASCII/Latin1 semantics.
* Move ANYOF folding from regexec to regcompKarl Williamson2011-02-021-10/+5
| | | | | | | | | | This is for security as well as performance. It allows Unicode properties to not be matched case sensitively. As a result the swash inversion hash is converted from having utf8 keys to numeric, code point, keys. It also for the first time fixes the bug where /i doesn't work for a code point not at the end of a range in a bracketed character class has a multi-character fold
* _swash_inversion_hash Rmv X from embed, add constKarl Williamson2011-02-021-1/+1
| | | | This shouldn't be called from XS code.
* Add initial inversion list objectKarl Williamson2011-02-021-0/+65
| | | | | | | | | | | | | | | Going forward the intent is to convert from swashes to the better-suited inversion list data structure. This adds rudimentary inversion lists that have only the functionality needed for 5.14. As a result, they are as much as possible static to one file. What's necessary for 5.14 is enough to allow folding of ANYOF nodes to be moved from regexec to regcomp. Why they are needed for that is to generate as compact as possible class definitions; otherwise, very long linear lists might be generated. (They still may be, but that's inherent in the problem domain; this generates as compact as possible, combining overlapping ranges, etc.) The only two non-trivial methods in this object are from published algorithms.
* newSVpvf_nocontext only visible with threads, fix for non-threadedTony Cook2011-01-101-4/+4
| | | | | | | | Ideally it would be available, calling Perl_newSVpvf_nocontext directly is an alternative, but the comment in sv.c makes that questionable. Since the function being called from already has a context, use it.
* utf8.c: Renumber cases in switchKarl Williamson2011-01-091-3/+3
| | | | This tidies things up after several of them were removed.
* utf8.c: Change to warn_d in two placesKarl Williamson2011-01-091-2/+3
| | | | The routines that these call used the warn_d forms; so these should as well.
* Add warnings for use of problematic code pointsKarl Williamson2011-01-091-0/+30
| | | | | | | | The non-Unicode code points have no Unicode semantics, so applying operations such as casing on them warns. This patch also includes the changes to test the warnings added by recent commits for handling the surrogates and above-Unicode code points
* utf8.c: Whitespace onlyKarl Williamson2011-01-091-35/+35
| | | | outdent in response to the enclosing block being removed
* utf8.c(): Default to allow problematic code pointsKarl Williamson2011-01-091-67/+163
| | | | | | | | | | | | | | | | | | | Surrogates, non-character code points, and code points that aren't in Unicode are now allowed by default, instead of having to specify a flag to allow them. (Most code did specify those flags anyway.) This affects uvuni_to_utf8_flags(), utf8n_to_uvuni() and various routines that are specialized interfaces to them. Now there is a new set of flags to disallow those code points. Further, all 66 of the non-character code points are known about and handled consistently, instead of just U+FFFF. Code that requires these code points to be forbidden will have to change to use the new flags. I have looked at all the (few) instances in CPAN where these routines are used, and the only one I found that appears to have need to do this, Encode, has already been patched to accommodate this change. Of course, I may have overlooked some subtleties.
* utf8.c: Nits in podKarl Williamson2011-01-091-6/+5
|
* Add check_utf8_print()Karl Williamson2011-01-091-0/+48
| | | | | | | This new function looks for problematic code points on output, and warns if any are found, returning FALSE as well. What it warns about may change, so is marked as experimental.
* In Perl_swash_init(), use call_sv() directly instead of call_method().Nicholas Clark2011-01-071-1/+1
| | | | | | This gives a small space saving on this platform, likely due to code being shared with the other call to call_sv(). (It also removes a level of function call at runtime.)
* In Perl_swash_init(), reuse any non-NULL return value from Perl_gv_fetchmeth().Nicholas Clark2011-01-071-2/+7
| | | | | | | | | Historically Perl_swash_init() called Perl_gv_fetchmeth() simply to determine if the requested package was loaded, and if not, attempt to load it. However, Perl_gv_fetchmeth() is actually making the same lookup as Perl_call_method() uses to get a pointer to the relevant method. Hence if we get a non-NULL return from Perl_gv_fetchmeth() we can pass it directly to Perl_call_sv(), and save duplicated work.
* utf8.c: add to commentKarl Williamson2010-12-191-1/+2
|
* utf8.c, .h: Clarify pod and commentKarl Williamson2010-12-191-2/+3
|
* Document use of strlen() by is_ascii_string(), is_utf8_string() and friends.Marvin Humphrey2010-12-081-3/+6
|
* pp.c, utf8.c: Convert to use TWO_BYTE_UTF8_TO_UNIKarl Williamson2010-11-221-4/+2
|
* Add Perl_bytes_cmp_utf8() to compare character sequences in different encodingsNicholas Clark2010-11-111-0/+69
| | | | | | | | | | | | | | | | | | | | | | | Convert sv_eq_flags() and sv_cmp_flags() to use it. Previously, to compare two strings of characters, where was was in UTF-8, and one was not, you had to either: 1: Upgrade the second to UTF-8 2: Compare the resulting octet sequence 3: Free the temporary UTF-8 string or: 1: Attempt to downgrade the first to bytes. If it can't be, they aren't equal 2: Else compare the resulting octet sequence 3: Free the temporary byte string Which for the general case involves a malloc()/free() and at least two O(n) scans per comparison. Whereas this approach has no allocation, a single O(n) scan, which terminates as early as the best case for the second approach.
* utf8.c: Add function to create inversion of swashKarl Williamson2010-11-071-0/+132
| | | | | | | | | | | | This adds _swash_inversion_hash() which takes a mapping swash and returns a hash that is the inverse relation. That is, given a code point, it allows quick lookup of all code points that map to it. The function is not for public use, as it will likely be revised, so is not in the public API, and it's name begins with underscore. It does not deal with multi-char mappings at this time, nor other swash complications.
* utf8.c: extract code into separate subroutineKarl Williamson2010-11-071-72/+102
| | | | | | This patch moves the code that reads a single line from the main body of an input Unicode property table into a separate subroutine. This is in preparation for using it from another place
* utf8.c: Add commentsKarl Williamson2010-11-071-7/+17
| | | | I added comments as I was reading the code trying to understand it
* Avoid compiler warnings in Perl_foldEQ_utf8, spotted by Jerry D. Hedden.Nicholas Clark2010-06-171-2/+6
|
* Change name of ibcmp to foldEQKarl Williamson2010-06-051-8/+8
| | | | | | | | | | | | | | | | As discussed on p5p, ibcmp has different semantics from other cmp functions in that it is a binary instead of ternary function. It is less confusing then to have a name that implies true/false. There are three functions affected: ibcmp, ibcmp_locale and ibcmp_utf8. ibcmp is actually equivalent to foldNE, but for the same reason that things like 'unless' and 'until' are cautioned against, I changed the functions to foldEQ, so that the existing names, like ibcmp_utf8 are defined as macros as being the complement of foldEQ. This patch also changes the one file where turning ibcmp into a macro causes problems. It changes it to use the new name. It also documents for the first time ibcmp, ibcmp_locale and their new names.
* utf8.c: further doc tweaksKarl Williamson2010-06-051-6/+11
|
* utf8.c: Modify doc comment; change whitespaceKarl Williamson2010-06-051-75/+74
| | | | | | | This removes the comment about the function name, and converts tabs to blanks throughout the function, as so much of it is changing already. It also removes trailing whitespace in other lines of the file.
* Revamp ibcmp_utf8 for performance and clarityKarl Williamson2010-06-051-108/+152
| | | | | | | | | | | | | | | | | | | | | | | | | | I had a hard time understanding how this routine worked; there were no comments. In figuring it out, I discovered it could be made more efficient. This routine is called over and over in the innermost loops in regex matching, so efficiency is a concern. Setup is done once before the main while loop so that it now has two conditions instead of eight. The loop was rearranged slightly to be smaller and a couple of unneeded assignments to temporaries were removed, and recomputation of some values was avoided. Several other small efficiency changes were made. Several asserts had been commented out, saying that they make tests fail. But they no longer do, at least on my platform. There was a reason that they were asserts to begin with, and that is they denoted an insane or trivial condition. Apparently there have been fixes to the other code calling this, so I re-enabled them. The names of several variables were changed to be less confusing; hence f1 means the fold buffer for string 1 whereas it used to mean its goal, which is now g1. The leading indent was changed from 5 to 4 blanks. I made enough other changes that I didn't submit this as a separate commit
* Clarify some documentationKarl Williamson2010-06-051-3/+5
|
* PATCH: user defined special casing for non utf8Karl Williamson2010-05-261-2/+1
| | | | | | | | | | Users can define their own case changing mappings to replace the standard ones. Prior to this patch, any mappings on characters whose ordinals are 0-222, 224-255 that resulted in multiple characters were ignored. Note that there still is a deficiency in that the mappings will be applied only to strings in utf8 format.
* PATCH: [perl #72998] regex loopingKarl Williamson2010-04-151-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | If a character folds to multiple ones in case-insensitive matching, it should not match just one of those, or the regular expression can loop. For example, \N{LATIN SMALL LIGATURE FF} folds to 'ff', and so "\N{LATIN SMALL LIGATURE FF}" =~ /f+/i should match. Prior to this patch, this function returned that there is a match, but left the matching string pointer at the beginning of the "\N{LATIN SMALL LIGATURE FF}" because it doesn't make sense to match just half a character, and at this level it doesn't know about the '+'. This leaves things in an inconsistent state, with the reporting of a match, but the input pointer unchanged, the result of which is a loop. I don't know how to fix this so that it correctly matches, and there are semantic issues with doing so. For example, if "\N{LATIN SMALL LIGATURE FF}" =~ /ff/i matches, then one would think that so should "\N{LATIN SMALL LIGATURE FF}" =~ /(f)(f)/i But $1 and $2 don't really make sense here, since they both refer to the half of the same character. So this patch just returns failure if only a partial character is matched. That leaves things consistent, and solves the problem of looping, so that Perl doesn't hang on such a construct, but leaves the ultimate solution for another day.
* [perl #73174] swash_init() wasn't saving %^HDavid Mitchell2010-03-021-2/+1
|
* change non-char warning message from malformedKarl Williamson2009-12-201-45/+49
|