summaryrefslogtreecommitdiff
path: root/handy.h
Commit message (Collapse)AuthorAgeFilesLines
* regex: Add pseudo-Posix class: 'cased'Karl Williamson2012-12-311-18/+21
| | | | | | | | | | | | | | | | | /[[:upper:]]/i and /[[:lower:]]/i should match the Unicode property \p{Cased}. This commit introduces a pseudo-Posix class, internally named 'cased', to represent this. This class isn't specifiable by the user, except through using either /[[:upper:]]/i or /[[:lower:]]/i. Debug output will say ':cased:'. The regex parsing either of :lower: or :upper: will change them into :cased:, where already existing logic can handle this, just like any other class. This commit fixes the regression introduced in 3018b823898645e44b8c37c70ac5c6302b031381, and that these have never worked under 'use locale'. The next commit will un-TODO the tests for these things.
* handy.h, regcomp.h, regexec.c: Sort initializers, switch()Karl Williamson2012-12-311-8/+8
| | | | | | | | Until recently, these were needed to be (or it made sense to be) in numerical value of what the rhs of each #define evaluates to. But now, they are all initialized to something else, and the numerical value is not even apparent. Alphabetical order gives a logical ordering to help a reader find things.
* perlapi: Clarify isSPACE(), document isPSXSPC()Karl Williamson2012-12-231-2/+25
|
* handy.h: Add full complement of isIDCONT() macrosKarl Williamson2012-12-231-3/+12
| | | | | | | This also changes isIDCONT_utf8() to use the Perl definition, which excludes any \W characters (the Unicode definition includes a few of these). Tests are also added. These macros remain undocumented for now.
* Remove temporary back-compat PL_ variable namesKarl Williamson2012-12-221-10/+0
| | | | | | These names are synonyms for specific array elements, and were used temporarily until all uses of them were removed. This commit removes the remaining uses, and the definitions
* handy.h: Improve some commentsKarl Williamson2012-12-221-9/+14
|
* Consolidate some regex OPSKarl Williamson2012-12-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The regular rexpression operation POSIXA works on any of the (currently) 16 posix classes (like \w and [:graph:]) under the regex modifier /a. This commit creates similar operations for the other modifiers: POSIXL (for /l), POSIXD (for /d), POSIXU (for /u), plus their complements. It causes these ops to be generated instead of the ALNUM, DIGIT, HORIZWS, SPACE, and VERTWS ops, as well as all their variants. The net saving is 22 regnode types. The reason to do this is for maintenance. As of this commit, there are now 22 fewer node types for which code has to be maintained. The code for each variant was essentially the same logic, but on different operands. It would be easy to make a change to one copy and forget to make the corresponding change in the others. Indeed, this patch fixes [perl #114272] in which one copy was out of sync with others. This patch actually reduces the number of separate code paths to 5: POSIXA, NPOSIXA, POSIXL, POSIXD, and POSIXU. The complements of the last 3 use the same code path as their non-complemented version, except that a variable is initialized differently. The code then XORs this variable with its result to do the complementing or not. Further, the POSIXD branch now just checks if the target string being matched is UTF-8 or not, and then jumps to either the POSIXU or POSIXA code respectively. So, there are effectively only 4 cases that are coded: POSIXA, NPOSIXA, POSIXL, and POSIXU. (POSIXA doesn't have to worry about UTF-8, while NPOSIXA does, hence these for efficiency are coded separately.) Removing all this code saves memory. The output of the Linux size command shows that the perl executable was shrunk by 33K bytes on my platform compiled under -O0 (.7%) and by 18K bytes (1.3%) under -O2. The reason this patch was doable was previous work in numbering the POSIX classes, so that they could be indexed in arrays and bit positions. This is a large patch; I didn't see how to break it into smaller components. I chose to make this code more efficient as opposed to saving even more memory. Thus there is a separate loop that is jumped to after we know we have to load a swash; this just saves having to test if the swash is loaded each time through the loop. I avoid loading the swash until absolutely necessary. In places in the previous version of this code, the swash was loaded when the input was UTF-8, even if it wasn't yet needed (and might never be if the input didn't contain anything above Latin1); apparently to avoid the extra test per iteration. The Perl test suite runs slightly faster on my platform with this patch under -O0, and the speeds are indistinguishable under -O2. This is in spite of these new POSIX regops being unknown to the regex optimizer (this will be addressed in future commits), and extra machine instructions being required for each character (the xor, and some shifting and masking). I expect this is a result of better caching, and not loading swashes unless absolutely necessary.
* handy.h: Refactor some internal macro callsKarl Williamson2012-12-221-73/+78
| | | | | I didn't plan very well when I added these macros recently. This refactors them to be more logical.
* Use array for some inversion listsKarl Williamson2012-12-221-0/+1
| | | | | | This patch creates an array pointing to the inversion lists that cover the Latin-1 ranges for Posix character classes, and uses it instead of the individual variables previously referred to.
* regcomp.c: Use table look-up instead of individual strings.Karl Williamson2012-12-221-1/+5
| | | | | | This changes to get the name for the character class's Unicode property via table lookup. This is in preparation for making most of the cases in this switch identical, so they can be collapsed.
* handy.h: Move some back compat macrosKarl Williamson2012-12-221-9/+7
| | | | Move them to the section that is for back-compat definitions.
* Add generic _is_(uni|utf8)_FOO() functionKarl Williamson2012-12-221-18/+45
| | | | | | This function uses table lookup to replace 9 more specific functions, which can be deprecated. They should not have been exposed to the public API in the first place
* handy.h: Create isALPHANUMERIC() and kinKarl Williamson2012-12-221-17/+38
| | | | | | | | | | | | | | | | | | | | | | Perl has had an undocumented macro isALNUMC() for a long time. I want to document it, but the name is very obscure. Neither Yves nor I are sure what it is. My best guess is "C's alnum". It corresponds to /[[:alnum:]]/, and so its best name would be isALNUM(). But that is the name long given to what matches \w. A new synonym, isWORDCHAR(), has been in place for several releases for that, but the old isALNUM() should remain for backwards compatibility. I don't think that the name isALNUMC() should be published, as it is too close to isALNUM(). I finally came to the conclusion that isALPHANUMERIC() is the best name; it describes its purpose clearly; the disadvantage is its long length. I doubt that it will get much use, but we need something, I think, that we can publish to accomplish this functionality. This commit also converts core uses of isALNUMC to isALPHANUMERIC. (I intended to that separately, but made a mistake in rebasing, and combined the two patches; and it seemed like not a big enough problem to separate them out again.)
* handy.h: Move some #definesKarl Williamson2012-12-221-10/+10
| | | | | | I'm moving this block of back-compat macros to later in the file, so it comes after all the other definitions that may need to have backwards compatibility equivalents
* intrpvar.h: Place some swash pointers in an arrayKarl Williamson2012-12-221-0/+12
|
* handy.h: Guard against recursive #inclusionKarl Williamson2012-12-221-0/+5
|
* regexec.c: Replace infamous if-else-if sequence by loopKarl Williamson2012-12-091-1/+2
| | | | | | This saves 1.5 KB in the text section on my machine in regexec.o (unoptimized) and 820 optimized. I did not benchmark, as we don't really care very much about performance under 'use locale'.
* handy.h: Add an enum typedefKarl Williamson2012-12-091-0/+23
| | | | | | | This creates a copy of all the Posix character class numbers and puts them in an enum. This enum is for internal Perl core use only, and is used so hopefully compilers can generate better code from future commits that will make use of it.
* handy.h: Reorder char class #defines; add commentsKarl Williamson2012-12-091-18/+35
| | | | | | This groups the Posix-like classes in two groups, one which contains those classes whose above-Latin1 lookups are done via swashes; the other which aren't. This will prove useful in future commits.
* handy.h: Add commentKarl Williamson2012-12-091-0/+8
|
* handy.h: Improve isDIGIT_utf8() and isXDIGIT_utf8() macrosKarl Williamson2012-12-091-2/+14
| | | | | There are no digits in the upper Latin1 range, therefore we can skip testing for such.
* handy.h: Change documentation for perlapiKarl Williamson2012-12-091-41/+148
| | | | | | | | | | | This documents several more of the character classification macros, including all variants of them. There are no code changes. The READ_XDIGIT macro was moved to "Miscellaneous Functions", as it really isn't character classification. Several of the macros remain undocumented because I'm not comfortable yet about their names/and or functionality.
* handy.h: Sort macros in groups alphabeticallyKarl Williamson2012-12-091-99/+91
| | | | | This should make it easier to find things. No code changes, but there are some comment changes
* handy.h: Make clear that macro is only true in ASCII rangeKarl Williamson2012-12-091-1/+1
| | | | | | I don't believe there are platforms that this is wrong on, but using the _A suffix clearly indicates that only ASCII-range characters can match this, like its cohort macros that surround it.
* handy.h: Fix isBLANK_LC_uni()Karl Williamson2012-12-091-1/+1
| | | | | | | | | | This macro can be defined in terms of the foo_uvchr() version. It should be correct on platforms that have an isblank() function in the C library. I don't know why this macro exists. It doesn't correspond to any of the other ones (though a recent commit removed one it did correspond to, but which can't have been in use because it expanded to a non-existent function). I'm leaving it in just for back compat. I did not add tests for this macro.
* handy.h: White space onlyKarl Williamson2012-12-091-27/+43
| | | | This makes things line up in columns and not exceed 80 columns in width.
* handy.h: Fix up Posix Space macrosKarl Williamson2012-12-091-5/+7
| | | | | | | Under the default Posix locale, \s and [[:space:]] are the same, so there is no need to try to make sure that [[:space:]] matches a vertical tab -- it already does. Also one of the macros had a typo, trying to add a form feed instead of a vertical tab
* Add functions for getting ctype ALNUMCKarl Williamson2012-12-091-0/+3
| | | | | | | We think this is meant to stand for C's alphanumeric, that is what is matched by POSIX [:alnum:]. There were not functions and a dedicated swash available for accessing it. Future commits will want to use these.
* handy.h: Add some missing macrosKarl Williamson2012-12-091-0/+11
| | | | | Not all character classifications had macros. This commit adds all but one of the missing ones. It will be added in a separate commit.
* handy.h: Add synonym for some macrosKarl Williamson2012-12-091-6/+11
| | | | | | | | | For some time, WORDCHAR has been preferred to ALNUM because of the nearly identical ALNUMC which means something else (the C language definition of alnum). This adds macros for WORDCHAR, while retaining ALNUM for backwards compatibility. Also, another macro is redefined using WORDCHAR in preference to ALNUM
* handy.h: Make some macros more time efficientKarl Williamson2012-12-091-13/+29
| | | | | | | | | | | | | | | | | | These macros check if a UTF-8 encoded character is of particular types for use with locales. Prior to this patch, they called a function to convert the character to a code point value. This was used as input to another macro that handles code points. For values above the Latin1 range, that macro calls a function, which converts back to UTF-8 and calls another function. This commit changes that to call the UTF-8 function directly for above-Latin1 code points. No conversion need be done. For Latin1 code points, it converts, if necessary, to the code point and calls a macro that handles these directly. Some of these macros now use a macro instead of a function call for above-Latin1 code points, as is done in various other places in this file.
* handy.h: Avoid function calls in 2 macrosKarl Williamson2012-12-091-2/+2
| | | | | | There is a macro that returns the same as the function call previously used in the SPACE macro; and nothing above Latin1 can possibly match the CNTRL macro
* handy.h: Define some macros using a base macroKarl Williamson2012-12-091-11/+16
| | | | | This allows the common parts of the definitions of these to all use the same logic
* handy.h: Define some locale macros for all inputsKarl Williamson2012-12-091-23/+23
| | | | | | | | Prior to this commit, if you called these macros with something outside the Latin1 range, the return value was not defined, subject to the whims of your particular C compiler and library. This commit causes all the boolean macros to return FALSE for non-Latin1 inputs; and all the map macros to return their inputs
* handy.h: Remove unused macroKarl Williamson2012-12-091-1/+0
| | | | | | | | | This macro expands to a function or macro call that does not exist, so this macro itself can't be being used by anyone. It also doesn't fit the paradigm of the other macros above it, being defined in terms of uni instead of uvchr; nor does it really gain anything, since \s is a posix space under locale. The \f also appears to be a typo, based on the commit message, it should have been \v.
* handy.h: Change EBCDIC isSPACE() to include \vKarl Williamson2012-12-091-1/+2
| | | | This was missed in commit 075b9d7d9a6d4473b240a047655e507c8baa6db3
* Make isIDFIRST_uni() return identically as isIDFIRST_utf8()Karl Williamson2012-11-291-5/+4
| | | | | | | These two macros should have the same results for the same input code points. Prior to this patch, the _uni() macro returned the official Unicode ID_Start property, and the _utf8() macro returned Perl's slightly restricted definition. Now both return Perl's.
* Remove double underscore in internal function nameKarl Williamson2012-11-291-1/+1
| | | | | | This function was added in 5.16, and has no callers in CPAN. It is undocumented and marked as changeable. Its name has two underscores in a row by mistake. This removes one of them.
* Refactor is(SPACE|PSXSP)_(uni|utf8) macros and utf8.cKarl Williamson2012-11-191-4/+5
| | | | | | | | | | | This refactors the isSPACE_uni, is_SPACE_utf8, isPSXSPC_uni, and is_PSXSPC_utf8 macros in handy.h, so that no function call need be done to handle above Latin1 input. These macros are quite small, and unlikely to grow over time, as Unicode has mostly finished adding white space equivalents to the Standard. The functions that implement these in utf8.c are also changed to use the macros instead of generating a swash. This should speed things up slightly, with less memory used over time as the swash fills.
* Refactor is_XDIGIT_uni(), is_XDIGIT_utf8() and macrosKarl Williamson2012-11-191-2/+2
| | | | | | | | | | This adds macros to regen/regcharclass.pl that are usable as part of the is_XDIGIT_foo() macros in handy.h, so that no function call need be done to handle above Latin1 input. These macros are quite small, and unlikely to grow over time. The functions that implement these in utf8.c are also changed to use the macros instead of generating a swash. This should speed things up slightly, with less memory used over time as the swash fills.
* Refactor is_BLANK_uni() and is_BLANK_utf8() macrosKarl Williamson2012-11-191-2/+2
| | | | | | | | | | | This adds macros to regen/regcharclass.pl that are usable as part of the is_BLANK_foo() macros in handy.h, so that no function call need be done to handle above Latin1 input. These macros are quite small, and unlikely to grow over time, as Unicode has mostly finished adding white space equivalents to the Standard. The functions that implement these in utf8.c are also changed to use the macros instead of generating a swash. This should speed things up slightly, with less memory used over time as the swash fills.
* handy.h: Add isVERTWS_uni(), isVERTWS_utf8()Karl Williamson2012-11-191-0/+2
| | | | | These two macros match the same things as \v does in patterns. I'm leaving them undocumented for now.
* Refactor is_CNTRL_utf8(), is_utf8_cntrl()Karl Williamson2012-11-191-1/+1
| | | | | | | All controls will always be in the Latin1 range by Unicode's stability policy. This means that we don't have to call is_utf8_cntrl() when the input to the is_CNTRL_utf8() macro is above Latin1; we can just fail. And that means that Perl_is_utf8_cntrl() can just use the macro.
* handy.h: Refactor macros to avoid aTHX_ problemsKarl Williamson2012-11-191-23/+28
| | | | | This refactors these macros so that other macros automatically add aTHX_ if necessary.
* regexes: Add \v to table of latin1 char classesKarl Williamson2012-11-191-9/+10
| | | | | | | This will be used in future commits to allow \v and \V to be treated consistently with other character classes. (Doing the same for \h isn't necessary, as it matches identically to [:blank:] in the entire Unicode range.)
* handy.h: white-space, comments onlyKarl Williamson2012-11-191-2/+8
|
* handy.h: Mark more clearly things for internal-only useKarl Williamson2012-11-191-50/+47
| | | | | | This changes the names of some macros to begin with an underscore, and changes comments to more clearly indicate things which aren't to be used outside the Perl core.
* handy.h: Use class numbers instead of macros in macro generationKarl Williamson2012-11-191-41/+44
| | | | | | | This refactors the macro builder macros to accept a class number instead of a macro name. This is easier to understand than having to use CAT2(), and it allows for a potential future commit to use these at run-time, given a class number.
* handy.h: Convert 3 macros to standard formKarl Williamson2012-11-191-13/+6
| | | | | | These three outliers don't have to be. They can use the same constructed form as the others surrounding them. One requires a temporary #define which will be removed in a future commit
* handy.h: Define some macros for consistencyKarl Williamson2012-11-191-2/+4
| | | | | | | isWORDCHAR_uni() and isWORDCHAR_utf8() are defined for consistency with the other isWORDCHAR...() macros The isALNUM_foo() versions are retained for backwards compatibility