summaryrefslogtreecommitdiff
path: root/utf8.c
Commit message (Collapse)AuthorAgeFilesLines
* utf8.c: White space only.Karl Williamson2013-05-201-4/+4
| | | | Indent in newly formed block
* Fix multi-char fold edge caseKarl Williamson2013-05-201-13/+48
| | | | | | | | | | | | | | | | | | | | | | | | | use locale; fc("\N{LATIN CAPITAL LETTER SHARP S}") eq 2 x fc("\N{LATIN SMALL LETTER LONG S}") should return true, as the SHARP S folds to two 's's in a row, and the LONG S is an antique variant of 's', and folds to s. Until this commit, the expression was false. Similarly, the following should match, but didn't until this commit: "\N{LATIN SMALL LETTER SHARP S}" =~ /\N{LATIN SMALL LETTER LONG S}{2}/iaa The reason these didn't work properly is that in both cases the actual fold to 's' is disallowed. In the first case because of locale; and in the second because of /aa. And the code wasn't smart enough to realize that these were legal. The fix is to special case these so that the fold of sharp s (both capital and small) is two LONG S's under /aa; as is the fold of the capital sharp s under locale. The latter is user-visible, and the documentation of fc() now points that out. I believe this is such an edge case that no mention of it need be done in perldelta.
* Expand flags parameter from boolean in _to_fold_latin1Karl Williamson2013-05-201-10/+9
| | | | This will be used in future commits to pass more flags.
* utf8.c: Replace two macro calls with equiv singleKarl Williamson2013-05-201-12/+6
| | | | | | | UTF8_IS_ABOVE_LATIN1() is equivalent to (! UTF8_IS_INVARIANT && !UTF8_IS_DOWNGRADEABLE_START) So we can use just it, for clearer code with fewer branches.
* autodoc.pl: Add note for deprecated functionsKarl Williamson2013-05-201-6/+0
| | | | | This causes each deprecated function to have a prominent note to that effect in its API documentation.
* Use new case changing macrosKarl Williamson2013-05-201-5/+4
| | | | | The previous commit added macros to do some case changing. This commit uses them in the core, where appropriate.
* perlapi: Add docs for some case-changing macros; clarify othersKarl Williamson2013-05-201-37/+4
| | | | | | | | | | | The case changing macros are now almost all documented. The exception is toUPPER_LC, which may change in 5.19 In addition the functions in utf8.c that these macros call now refer to them instead of having their own documentation. People should really be using the macros instead of calling the functions directly. I'm not deprecating the functions because I can't foresee the need to change them, so code that uses them should continue to be ok.
* utf8.c: Remove redundant assignment.Karl Williamson2013-05-201-1/+0
| | | | This variable is always set just below.
* utf8.c: Use mnemonics instead of hex numbersKarl Williamson2013-05-201-4/+9
|
* S_* functions should be STATICJan Dubois2013-04-041-2/+4
|
* Perl_sv_uni_display() needs to be aware of RX_WRAPPED()Nicholas Clark2013-03-191-1/+4
| | | | | | | Commit 8d919b0a35f2b57a changed the storage location of the string in SVt_REGEXP. It updated most code to deal with this, but missed the use of SvPVX_const() in Perl_sv_uni_display(). This breaks dumping regular expressions which have the UTF-8 flag set.
* Add, fix commentsKarl Williamson2013-02-251-5/+10
|
* utf8.c: Reword a warning messageKarl Williamson2013-01-161-1/+1
| | | | This follows the suggestion by Aristotle Pagaltzis.
* handy.h: Add full complement of isIDCONT() macrosKarl Williamson2012-12-231-0/+19
| | | | | | | This also changes isIDCONT_utf8() to use the Perl definition, which excludes any \W characters (the Unicode definition includes a few of these). Tests are also added. These macros remain undocumented for now.
* Deprecate calling isFOO_utf8() with malformedKarl Williamson2012-12-231-3/+13
| | | | | | | | | | | | | | | | | | | | | | | handy.h has character classification macros to determine if a UTF-8 encoded character is of a given type FOO, such as isALPHA_utf8(), etc. Code that calls these should have first made sure that the parameter is legal UTF-8. Prior to this patch, false was silently returned for all illegal UTF-8. Now, in most instances, a deprecation warning is raised. This is to catch bugs, and prepare for eventual elimination of this check, which fails to catch read-off-end-of-buffer malformations anyway. (One idea would be to leave the check in for DEBUGGING builds.) The cases where no deprecation warning is raised as a result of this commit is for the classes where the character does not have to be converted to a code point for its inclusion to be determined. For example, if malformed UTF-8 is checked to see if it is ASCII, we only need to check that it is one of the 128 ASCII characters. If it isn't, we don't bother to see if it is malformed or not. There are other cases, as well, such as with isSPACE(), where we check if the UTF-8 is one of a very finite set, without checking for malformedness. This commit causes a number of apparent bugs to be shown by the Perl test suite. These do not cause actual failures.
* Create internal _is_utf8_mark()Karl Williamson2012-12-221-1/+12
| | | | | | | | | | | This is so we can deprecate non-core use of the existing one in a future commit. XS coders should be using the macros in handy.h instead of calling such functions directly. A future commit will deprecate all of them, but first the core uses of this one must change so they don't generate deprecation messages. I will not have a chance to look for some time, but I suspect that most uses of this function in the core should be changed to use something else, but in the meantime, the non-core uses can be deprecated.
* Remove temporary back-compat PL_ variable namesKarl Williamson2012-12-221-9/+9
| | | | | | These names are synonyms for specific array elements, and were used temporarily until all uses of them were removed. This commit removes the remaining uses, and the definitions
* utf8.c: Remove two internal now unused functions.Karl Williamson2012-12-221-20/+0
| | | | | These functions were used internally as helpers for matching \X in regular expressions. They are no longer used.
* Add generic _is_(uni|utf8)_FOO() functionKarl Williamson2012-12-221-18/+38
| | | | | | This function uses table lookup to replace 9 more specific functions, which can be deprecated. They should not have been exposed to the public API in the first place
* handy.h: Create isALPHANUMERIC() and kinKarl Williamson2012-12-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | Perl has had an undocumented macro isALNUMC() for a long time. I want to document it, but the name is very obscure. Neither Yves nor I are sure what it is. My best guess is "C's alnum". It corresponds to /[[:alnum:]]/, and so its best name would be isALNUM(). But that is the name long given to what matches \w. A new synonym, isWORDCHAR(), has been in place for several releases for that, but the old isALNUM() should remain for backwards compatibility. I don't think that the name isALNUMC() should be published, as it is too close to isALNUM(). I finally came to the conclusion that isALPHANUMERIC() is the best name; it describes its purpose clearly; the disadvantage is its long length. I doubt that it will get much use, but we need something, I think, that we can publish to accomplish this functionality. This commit also converts core uses of isALNUMC to isALPHANUMERIC. (I intended to that separately, but made a mistake in rebasing, and combined the two patches; and it seemed like not a big enough problem to separate them out again.)
* perlapi: Grammar nitKarl Williamson2012-12-221-5/+6
|
* utf8.c: Fix reference count in swash_to_invlist()Karl Williamson2012-12-221-2/+3
| | | | | | | | The return SV* from this function was inconsistent in its reference count. In some cases it creates a new SV, which has a reference count of 1, and in some cases it returned an existing SV without incrementing the reference count. If the caller thought it was getting its own copy, and decremented the reference count, it could lead to a double free.
* Deprecate some functions in utf8.cKarl Williamson2012-12-091-9/+18
| | | | | These functions are not used by the Perl core. Code should be using the equivalent macros in handy.h that may avoid a function call.
* utf8.c: Add locale support to functionsKarl Williamson2012-12-091-18/+56
| | | | | | | These functions were marked as XXX to add locale support. It was a simple matter to do. We support locales for code points under 256, so just call the appropriate macro for those, returning the Unicode interpretation for those over 255.
* utf8.c: Change is_uni_idfirst_lc() to use Perl's defnKarl Williamson2012-12-091-1/+1
| | | | | The Perl definition is slightly more restrictive of what Unicode's idfirst is. We should use our definition consistently.
* Add functions for getting ctype ALNUMCKarl Williamson2012-12-091-0/+27
| | | | | | | We think this is meant to stand for C's alphanumeric, that is what is matched by POSIX [:alnum:]. There were not functions and a dedicated swash available for accessing it. Future commits will want to use these.
* utf8.c: Combine 2 function calls into oneKarl Williamson2012-12-091-2/+1
| | | | There is a function that does both these together, more efficiently
* utf8.c: Move ARGS_ASSERT to earlier in functionKarl Williamson2012-12-091-2/+2
| | | | to a place where people more expect to see it.
* Make isIDFIRST_uni() return identically as isIDFIRST_utf8()Karl Williamson2012-11-291-0/+8
| | | | | | | These two macros should have the same results for the same input code points. Prior to this patch, the _uni() macro returned the official Unicode ID_Start property, and the _utf8() macro returned Perl's slightly restricted definition. Now both return Perl's.
* Remove double underscore in internal function nameKarl Williamson2012-11-291-2/+2
| | | | | | This function was added in 5.16, and has no callers in CPAN. It is undocumented and marked as changeable. Its name has two underscores in a row by mistake. This removes one of them.
* New COW mechanismFather Chrysostomos2012-11-271-10/+17
| | | | | | | | | | | | | | | | | | | | | | | | This was discussed in ticket #114820. This new copy-on-write mechanism stores a reference count for the PV inside the PV itself, at the very end. (I was using SvEND+1 at first, but parts of the regexp engine expect to be able to do SvCUR_set(sv,0), which causes the wrong byte of the string to be used as the reference count.) Only 256 SVs can share the same PV this way. Also, only strings with allocated space after the trailing null can be used for copy-on-write. Much of the code is shared with PERL_OLD_COPY_ON_WRITE. The restric- tion against doing copy-on-write with magical variables has hence been inherited, though it is not necessary. A future commit will take care of that. I had to modify _core_swash_init to handle $@ differently. The exist- ing mechanism of copying $@ to a new scalar and back again was very fragile. With copy-on-write, $@ =~ s/// can cause pp_subst’s string pointers to become stale. So now we remove the scalar from *@ and allow the utf8-table-loading code to autovivify a new one. Then we restore the untouched $@ afterwards if all goes well.
* Remove "register" declarationsKarl Williamson2012-11-241-2/+2
| | | | | | | This finishes the removal of register declarations started by eb578fdb5569b91c28466a4d1939e381ff6ceaf4. It neglected the ones in function parameter declarations, and didn't include things in dist, ext, and lib, which this does include
* Request is_utf8_char_slow() be inlinedKarl Williamson2012-11-241-1/+1
|
* prevent multiple evaluations of ERRSVDaniel Dragan2012-11-231-4/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove a large amount of machine code (~4KB for me) from funcs that use ERRSV making Perl faster and smaller by preventing multiple evaluation. ERRSV is a macro that contains GvSVn which eventually conditionally calls Perl_gv_add_by_type. If a SvTRUE or any other multiple evaluation macro is used on ERRSV, the expansion will, in asm have dozens of calls to Perl_gv_add_by_type one for each test/deref of the SV in SvTRUE. A less severe problem exists when multiple funcs (sv_set*) in a row call, each with ERRSV as an arg. Its recalculated then, Perl_gv_add_by_type and all. I think ERRSV macro got the func call in commit f5fa9033b8, Perl RT #70862. Prior to that commit it would be pure derefs I think. Saving the SV* is still better than looking into interp->gv->gp to get the SV * after each func call. I received no responses to http://www.nntp.perl.org/group/perl.perl5.porters/2012/11/msg195724.html explaining when the SV is replaced in PL_errgv, so took a conservative view and assumed callbacks (with Perl stack/ENTER/LEAVE/eval_*/call_*) can change it. I also assume ERRSV will never return null, this allows a more efficiently version of SvTRUE to be used. In Perl_newATTRSUB_flags a wasteful copy to C stack operation with the string was removed, and a croak_notcontext to remove push instructions to the stack. I was not sure about the interaction between ERRSV and message sv, I didn't change it to a more efficient (instruction wise, speed, idk) format string combining of the not safe string and ERRSV in the croak call. If such an optimization is done, a compiler potentially will put the not safe string on the first, unconditionally, then check PL_in_eval, and then jump to the croak call site, or eval ERRSV, push the SV on the C stack then push the format string "%"SVf"%s". The C stack allocated const char array came from commit e1ec3a884f . In Perl_eval_pv, croak_on_error was checked first to not eval ERRSV unless necessery. I was not sure about the side effects of using a more efficient croak_sv instead of Perl_croak (null chars, utf8, etc) so I left a comment. nocontext used to save an push instruction on implicit sys perl. In S_doeval, don't open a new block to avoid large whitespace changes. The NULL assignment should optimize away unless accidental usage of errsv in the future happens through a code change. There might be a bug here from commit ecad31f018 since previous a char * was derefed to check for null char, but ERRSV will never be null, so "Unknown error\n" branch will never be taken. For pp_sys.c, in pp_die a new block was opened to not eval ERRSV if "well-formed exception supplied". The else if else if else blocks all used ERRSV, so a "SV * errsv = NULL;" and a eval in the conditional with comma op thing wouldn't work (maybe it would, see toke.c comments later in this message). pp_warn, I have no comments. In S_compile_runtime_code, a croak_sv question comes up same as in Perl_eval_pv. In S_new_constant, a eval in the conditional is done to avoid evaling ERRSV if PL_in_eval short circuits. Same thing in Perl_yyerror_pvn. Perl__core_swash_init I have no comments. In the future, a SvEMPTYSTRING macro should be considered (not fully thought out by me) to replace the SvTRUEs with something smaller and faster when dealing with ERRSV. _nomg is another thing to think about. In S_init_main_stash there is an opportunity to prevent an extra ERRSV between "sv_grow(ERRSV, 240);" and "CLEAR_ERRSV();" that was too complicated for me to optimize. before perl517.dll .text 0xc2f77 .rdata 0x212dc .data 0x3948 after perl517.dll .text 0xc20d7 .rdata 0x212dc .data 0x3948 Numbers are from VC 2003 x86 32 bit.
* Refactor is(SPACE|PSXSP)_(uni|utf8) macros and utf8.cKarl Williamson2012-11-191-4/+2
| | | | | | | | | | | This refactors the isSPACE_uni, is_SPACE_utf8, isPSXSPC_uni, and is_PSXSPC_utf8 macros in handy.h, so that no function call need be done to handle above Latin1 input. These macros are quite small, and unlikely to grow over time, as Unicode has mostly finished adding white space equivalents to the Standard. The functions that implement these in utf8.c are also changed to use the macros instead of generating a swash. This should speed things up slightly, with less memory used over time as the swash fills.
* Refactor is_XDIGIT_uni(), is_XDIGIT_utf8() and macrosKarl Williamson2012-11-191-4/+2
| | | | | | | | | | This adds macros to regen/regcharclass.pl that are usable as part of the is_XDIGIT_foo() macros in handy.h, so that no function call need be done to handle above Latin1 input. These macros are quite small, and unlikely to grow over time. The functions that implement these in utf8.c are also changed to use the macros instead of generating a swash. This should speed things up slightly, with less memory used over time as the swash fills.
* Refactor is_BLANK_uni() and is_BLANK_utf8() macrosKarl Williamson2012-11-191-4/+2
| | | | | | | | | | | This adds macros to regen/regcharclass.pl that are usable as part of the is_BLANK_foo() macros in handy.h, so that no function call need be done to handle above Latin1 input. These macros are quite small, and unlikely to grow over time, as Unicode has mostly finished adding white space equivalents to the Standard. The functions that implement these in utf8.c are also changed to use the macros instead of generating a swash. This should speed things up slightly, with less memory used over time as the swash fills.
* Refactor is_CNTRL_utf8(), is_utf8_cntrl()Karl Williamson2012-11-191-9/+1
| | | | | | | All controls will always be in the Latin1 range by Unicode's stability policy. This means that we don't have to call is_utf8_cntrl() when the input to the is_CNTRL_utf8() macro is above Latin1; we can just fail. And that means that Perl_is_utf8_cntrl() can just use the macro.
* utf8.c: Request function to be inlineKarl Williamson2012-11-191-1/+1
| | | | | This could remove a layer of function call overhead for this small function, (if the compiler doesn't already choose to inline it).
* utf8.c: White-space, comments onlyKarl Williamson2012-11-191-2/+3
|
* utf8.c: Fix potential bugKarl Williamson2012-11-191-13/+28
| | | | | | | | | | | | | Commit 87367d5f9dc9bbf7db1a6cf87820cea76571bf1a changed core_invlist_init() to return not the swash, but the swash's inversion list if small enough, allowing a faster binary search than a slower hash look-up on small lists. Calls to two functions that access swashes were changed to make this transparent. However, there are two more such functions which were overlooked, and need to be upgraded to provide such transparency, should they ever be called on swashes that have been converted. This commit fixes one of them, but leaves the other, with a comment, as it's much harder to do, and will not ever likely be called on such a swash (it is for internal core use only).
* Stop \P{Assigned} from leakingFather Chrysostomos2012-11-181-1/+1
| | | | | | | I suspect this leak also applies to any large character classes. An HV created with newHV has a reference count of 1, so doing newRV_inc on it will cause a leak.
* utf8.c: Fix a minor refcounting bug caused by 02c8547Father Chrysostomos2012-11-181-0/+3
| | | | | | | | Under some circumstances it could cause a hash to point to a freed element. But the hash itself was leaking, so it caused on problems, as no attempt to free its element again was made. The next commit will stop the hash from leaking.
* Stop $unicode =~ /[[:posix:]]/ from leakingFather Chrysostomos2012-11-181-1/+6
| | | | | | | | If we have just created an SV, it has a reference count of 1, so using newRV_inc on it will create a leak. So we need to use newRV_noinc and do SvREFCNT_inc in those cases where the SV is not new. This has leaked since v5.17.3-117-g87367d5.
* utf8.c: Don't copy a buffer to itselfKarl Williamson2012-11-141-1/+3
| | | | | | | | | memcpy(), which is what Copy() resolves to, is not supposed to handle the possibility of overlapping source and destination. In some cases in this code, the source and destination pointers are identical. What should happen then is a no-op, so just don't do the copy in that case. If the ptrs aren't identical, they won't otherwise overlap, so the Copy() is valid except for when they are identical.
* utf8.c: Remove redundant unlikely branchesKarl Williamson2012-11-111-4/+0
| | | | | | | The 2 lines removed in each function provide an early exit if the input is malformed UTF-8. Other code executed later makes the same test. But most inputs are going to be well-formed, so the test will almost always fail, so will slow things up.
* utf8n_to_uvuni() pod: Add clarificationsKarl Williamson2012-11-111-1/+7
|
* Add C define to remove taint support from perlSteffen Mueller2012-11-051-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | By defining NO_TAINT_SUPPORT, all the various checks that perl does for tainting become no-ops. It's not an entirely complete change: it doesn't attempt to remove the taint-related interpreter variables, but instead virtually eliminates access to it. Why, you ask? Because it appears to speed up perl's run-time significantly by avoiding various "are we running under taint" checks and the like. This change is not in a state to go into blead yet. The actual way I implemented it might raise some (valid) objections. Basically, I replaced all uses of the global taint variables (but not PL_taint_warn!) with an extra layer of get/set macros (TAINT_get/TAINTING_get). Furthermore, the change is not complete: - PL_taint_warn would likely deserve the same treatment. - Obviously, tests fail. We have tests for -t/-T - Right now, I added a Perl warn() on startup when -t/-T are detected but the perl was not compiled support it. It might be argued that it should be silently ignored! Needs some thinking. - Code quality concerns - needs review. - Configure support required. - Needs thinking: How does this tie in with CPAN XS modules that use PL_taint and friends? It's easy to backport the new macros via PPPort, but that doesn't magically change all code out there. Might be harmless, though, because whenever you're running under NO_TAINT_SUPPORT, any check of PL_taint/etc is going to come up false. Thus, the only CPAN code that SHOULD be adversely affected is code that changes taint state.
* utf8.c: Stop _core_swash_init from leakingFather Chrysostomos2012-10-301-2/+2
| | | | | If an %INC hook or $@ assignment dies, then a scalar is leaked. I don’t know that it is possible to test this.
* perlapi.pod: Clarify what a parameter meansKarl Williamson2012-10-201-3/+5
|