summaryrefslogtreecommitdiff
path: root/utf8.c
Commit message (Collapse)AuthorAgeFilesLines
...
* mro.c: Correct utf8 and bytes concatenationFather Chrysostomos2011-10-061-0/+3
| | | | | | | | | | | | | | | | | | | The previous commit introduced some code that concatenates a pv on to an sv and then does SvUTF8_on on the sv if the pv was utf8. That can’t work if the sv was in Latin-1 (or single-byte) encoding and contained extra-ASCII characters. Nor can it work if bytes are appended to a utf8 sv. Both produce mangled utf8. There is apparently no function apart from sv_catsv that handle this. So I’ve modified sv_catpvn_flags to handle this if passed the SV_CATUTF8 (concatenating a utf8 pv) or SV_CATBYTES (cancatenating a byte pv) flag. This avoids the overhead of creating a new sv (in fact, sv_catsv even copies its rhs in some cases, so that would mean creating two new svs). It might even be worthwhile to redefine sv_catsv in terms of this....
* utf8.c: Add function to retrieve new _Perl_IDStart propKarl Williamson2011-10-011-0/+10
|
* Comment-only nitsKarl Williamson2011-10-011-3/+4
|
* utf8.c: Remove (mostly) redundant testKarl Williamson2011-10-011-4/+0
| | | | | | | | The swashes already have the underscore, so this test is redundant. It does save some time for this character to avoid having to go out and load the swash, but why just the underscore? In fact an earlier commit changed the macro that most people should use to access this function to not even call it for the underscore.
* Don't use swash to find cntrlsKarl Williamson2011-10-011-4/+10
| | | | | | | | | Unicode stability policy guarantees that no code points will ever be added to the control characters beyond those already in it. All such characters are in the Latin1 range, and so the Perl core already knows which ones those are, and so there is no need to go out to disk and create a swash for these.
* utf8.c: Use less confusing property nameKarl Williamson2011-10-011-1/+1
| | | | | The XPerlSpace is less confusing than SpacePerl (at least to me). It means take PerlSpace and extend it beyond ASCII.
* No need for swashes for properties that are ASCII-onlyKarl Williamson2011-10-011-3/+9
| | | | | | These three properties are restricted to being true only for ASCII characters. That information is compiled into Perl, so no need to create swashes for them.
* No need for swashes for computing if ASCIIKarl Williamson2011-10-011-4/+4
| | | | | This information is trivially computed via the macro, no need to go out to disk and store a swash for this.
* utf8.c: Call new function invlist_invert_prop()Karl Williamson2011-10-011-1/+1
| | | | | This new function is now potentially called. However, there is no data file or other circumstances which currently cause this path to get executed.
* Revise diagnostic textKarl Williamson2011-10-011-1/+1
| | | | I believe that the new wording is clearer than the older, which I wrote.
* utf8.c: White space onlyKarl Williamson2011-10-011-5/+5
| | | | This indents a block of code to match being in a newly created block
* utf8.c: Don't invert beyond-Unicode code pointsKarl Williamson2011-10-011-0/+14
| | | | | | | The Unicode properties are defined only on Unicode code points. In the past, this meant all property matches would fail for non-Unicode code points. However, starting with 5.15.1 some properties do succeed. This restores the previous behavior.
* [perl #99984] Incorrect errmsg with our $::éFather Chrysostomos2011-10-011-0/+2
| | | | | | | | Having PL_parser->error_count set to non-zero when utf8_heavy.pl tries to do() one of its swashes results in ‘Compilation error’ being placed in $@ during the do, even if it was successful. This patch sets the error_count to 0 before calling SWASHNEW, to prevent that. It uses SAVEI8, to make sure it is restored on scope exit.
* Convert some files from Latin-1 to UTF-8Keith Thompson2011-09-071-2/+2
|
* utf8.c: Accept INVERT_IT in swashKarl Williamson2011-07-031-1/+17
| | | | | | | This allows a swash to return a list, along with an extra key in the hash which says that the list should be inverted. A future commit will generate such keys.
* utf8.c: swash_to_invlist() handle EXTRASKarl Williamson2011-07-031-1/+78
| | | | | | | | | | | | | | | This function has not been able to handle what are called EXTRAS in its input. These are things like: !utf8::InHiragana -utf8::InKatakana +utf8::IsCn besides the normal list of ranges. This commit allows this function to handle all the same constructs as the regular swash input function, from which most of the new code was copied.
* Add flag to num groks to silence non-portable warningsKarl Williamson2011-07-031-4/+9
| | | | | | | Unicode inversion lists commonly will contain UV_MAX, which may trigger these warnings. Add a flag to suppress them to the numeric grok functions, which can be set by the code that is dealing with these lists
* Change inversion lists to SVsKarl Williamson2011-07-031-2/+2
| | | | | | The inversion list is an opaque object, currently implemented as an SV. Even if it ends up being an HV in the future, Nicholas is of the opinion that it should be presented to the world as an SV*.
* utf8.c: revise commentKarl Williamson2011-05-201-2/+3
|
* Fix some multi-char /i fold bugsKarl Williamson2011-05-191-4/+147
| | | | | | | | | | | | | | | | | | | | | | Consider U+FB05 and U+FB06. These both fold to 'st', and hence should match each other under /i. However, Unicode doesn't furnish a rule for this, and Perl hasn't been smart enought to figure it out. The bug that shows up is in constructs like "\x{fb06}" =~ /[^\x{fb05}]/i succeeding. Most of these instances also have a 'S' entry in Unicode's CaseFolding.txt, which avoids the problem (as mktables was earlier changed to include those in the generated table). But there were several code points that didn't. This patch changes utf8.c to look for these when constructing it's inverted list of case fold equivalents. An alternative would have been to change mktables instead to look for them and create synthetic rules. But, this is more general in case the function ends up being used for other things. I will change fold_grind.t to test for these in a separate commit.
* utf8.c: Remove soon-to-be-obsoleted commentKarl Williamson2011-05-191-16/+0
| | | | | This comment will no longer apply, as the code it talked about is moving into swash_init().
* utf8.c: Remove unnecessary temporaryKarl Williamson2011-05-191-5/+2
|
* utf8.c: "<" should be "<="Karl Williamson2011-05-191-1/+1
| | | | | | | av_len() is misnamed, and hence led me earlier to stop the loop one shy of what it should have been. No actual bugs were caused by this, but it could cause a duplicate entry in an array, which is searched linearly, hence a slight slowdown.
* utf8.c: Add _flags version of to_utf8_fold()Karl Williamson2011-05-031-7/+12
| | | | | | | | | | And also to_uni_fold(). The flag allows retrieving either simple or full folds. The interface is subject to change, so these are marked experimental and their names begin with underscore. The old versions are turned into macros calling the new versions with the correct extra parameter.
* utf8.c: silence compiler warningsDavid Mitchell2011-03-261-2/+2
| | | | | prefer foo("%s", fixedstr) over foo(fixedstr). One day someone might change fixedstr to include '%' characters.
* foldEQ_utf8(): Move rare tests out of main streamKarl Williamson2011-02-221-14/+12
| | | | | | | The code for handling locale can be moved entirely to the place where locale handling is done for the second string, as by that time we have processed the first string, and the second. Since we only succeed if both are atomic, single-bytes, we don't need to do the loop below.
* utf8.c: Fix setting wrong variableKarl Williamson2011-02-191-1/+1
| | | | This doesn't appear to actually break anything.
* foldEQ_utf8(): Add locale handlingKarl Williamson2011-02-191-5/+66
| | | | | | A new flag is now passable to this function to indicate to use locale rules for code points below 256. Unicode rules are still used for above 255. Folds which cross that boundary are disallowed.
* Subclass utf8 warnings so can turn off individuallyKarl Williamson2011-02-171-22/+31
|
* handy.h: isIDFIRST_utf8() changed to use XIDStartKarl Williamson2011-02-171-0/+25
| | | | | | | | | | Previously this used a home-grown definition of an identifier start, stemming from a bug in some early Unicode versions. This led to some problems, fixed by #74022. But the home-grown solution did not track Unicode, and allowed for characters, like marks, to begin words when they shouldn't. This change brings this macro into compliance with Unicode going-forward.
* foldEQ_utf8_flags: Add no-mixing ASCII optionKarl Williamson2011-02-141-2/+25
| | | | | If this option is set, any match that has a non-ASCII character that has an ASCII character in its fold will not match that fold.
* foldEQ_utf8: Add version with flags parameterKarl Williamson2011-02-141-2/+2
| | | | | The parameter doesn't do anything yet. The old version becomes a macro calling the new version with 0 as the flags.
* Silence compile warnings before uni tables builtKarl Williamson2011-02-061-1/+11
| | | | | | | | | | The recent move of Unicode folding to the compilation phase caused spurious warnings during the miniperl build phase of Perl itself before the Unicode tables get built. Before the tables are built, Perl is unable to know about the Unicode semantics (it has ASCII/Latin1 hard-coded in), but was still trying to access the tables. Now, it checks and if the tables aren't present uses just the hard-coded ASCII/Latin1 semantics.
* Move ANYOF folding from regexec to regcompKarl Williamson2011-02-021-10/+5
| | | | | | | | | | This is for security as well as performance. It allows Unicode properties to not be matched case sensitively. As a result the swash inversion hash is converted from having utf8 keys to numeric, code point, keys. It also for the first time fixes the bug where /i doesn't work for a code point not at the end of a range in a bracketed character class has a multi-character fold
* _swash_inversion_hash Rmv X from embed, add constKarl Williamson2011-02-021-1/+1
| | | | This shouldn't be called from XS code.
* Add initial inversion list objectKarl Williamson2011-02-021-0/+65
| | | | | | | | | | | | | | | Going forward the intent is to convert from swashes to the better-suited inversion list data structure. This adds rudimentary inversion lists that have only the functionality needed for 5.14. As a result, they are as much as possible static to one file. What's necessary for 5.14 is enough to allow folding of ANYOF nodes to be moved from regexec to regcomp. Why they are needed for that is to generate as compact as possible class definitions; otherwise, very long linear lists might be generated. (They still may be, but that's inherent in the problem domain; this generates as compact as possible, combining overlapping ranges, etc.) The only two non-trivial methods in this object are from published algorithms.
* newSVpvf_nocontext only visible with threads, fix for non-threadedTony Cook2011-01-101-4/+4
| | | | | | | | Ideally it would be available, calling Perl_newSVpvf_nocontext directly is an alternative, but the comment in sv.c makes that questionable. Since the function being called from already has a context, use it.
* utf8.c: Renumber cases in switchKarl Williamson2011-01-091-3/+3
| | | | This tidies things up after several of them were removed.
* utf8.c: Change to warn_d in two placesKarl Williamson2011-01-091-2/+3
| | | | The routines that these call used the warn_d forms; so these should as well.
* Add warnings for use of problematic code pointsKarl Williamson2011-01-091-0/+30
| | | | | | | | The non-Unicode code points have no Unicode semantics, so applying operations such as casing on them warns. This patch also includes the changes to test the warnings added by recent commits for handling the surrogates and above-Unicode code points
* utf8.c: Whitespace onlyKarl Williamson2011-01-091-35/+35
| | | | outdent in response to the enclosing block being removed
* utf8.c(): Default to allow problematic code pointsKarl Williamson2011-01-091-67/+163
| | | | | | | | | | | | | | | | | | | Surrogates, non-character code points, and code points that aren't in Unicode are now allowed by default, instead of having to specify a flag to allow them. (Most code did specify those flags anyway.) This affects uvuni_to_utf8_flags(), utf8n_to_uvuni() and various routines that are specialized interfaces to them. Now there is a new set of flags to disallow those code points. Further, all 66 of the non-character code points are known about and handled consistently, instead of just U+FFFF. Code that requires these code points to be forbidden will have to change to use the new flags. I have looked at all the (few) instances in CPAN where these routines are used, and the only one I found that appears to have need to do this, Encode, has already been patched to accommodate this change. Of course, I may have overlooked some subtleties.
* utf8.c: Nits in podKarl Williamson2011-01-091-6/+5
|
* Add check_utf8_print()Karl Williamson2011-01-091-0/+48
| | | | | | | This new function looks for problematic code points on output, and warns if any are found, returning FALSE as well. What it warns about may change, so is marked as experimental.
* In Perl_swash_init(), use call_sv() directly instead of call_method().Nicholas Clark2011-01-071-1/+1
| | | | | | This gives a small space saving on this platform, likely due to code being shared with the other call to call_sv(). (It also removes a level of function call at runtime.)
* In Perl_swash_init(), reuse any non-NULL return value from Perl_gv_fetchmeth().Nicholas Clark2011-01-071-2/+7
| | | | | | | | | Historically Perl_swash_init() called Perl_gv_fetchmeth() simply to determine if the requested package was loaded, and if not, attempt to load it. However, Perl_gv_fetchmeth() is actually making the same lookup as Perl_call_method() uses to get a pointer to the relevant method. Hence if we get a non-NULL return from Perl_gv_fetchmeth() we can pass it directly to Perl_call_sv(), and save duplicated work.
* utf8.c: add to commentKarl Williamson2010-12-191-1/+2
|
* utf8.c, .h: Clarify pod and commentKarl Williamson2010-12-191-2/+3
|
* Document use of strlen() by is_ascii_string(), is_utf8_string() and friends.Marvin Humphrey2010-12-081-3/+6
|
* pp.c, utf8.c: Convert to use TWO_BYTE_UTF8_TO_UNIKarl Williamson2010-11-221-4/+2
|