| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
/[[:upper:]]/i and /[[:lower:]]/i should match the Unicode property
\p{Cased}. This commit introduces a pseudo-Posix class, internally named
'cased', to represent this. This class isn't specifiable by the user,
except through using either /[[:upper:]]/i or /[[:lower:]]/i. Debug
output will say ':cased:'.
The regex parsing either of :lower: or :upper: will change them into
:cased:, where already existing logic can handle this, just like any
other class.
This commit fixes the regression introduced in
3018b823898645e44b8c37c70ac5c6302b031381, and that these have never
worked under 'use locale'. The next commit will un-TODO the tests for
these things.
|
|
|
|
|
|
|
|
| |
Until recently, these were needed to be (or it made sense to be) in
numerical value of what the rhs of each #define evaluates to. But now,
they are all initialized to something else, and the numerical value is
not even apparent. Alphabetical order gives a logical ordering to help
a reader find things.
|
| |
|
|
|
|
|
|
|
| |
This also changes isIDCONT_utf8() to use the Perl definition, which
excludes any \W characters (the Unicode definition includes a few of
these). Tests are also added. These macros remain undocumented for
now.
|
|
|
|
|
|
| |
These names are synonyms for specific array elements, and were used
temporarily until all uses of them were removed. This commit removes
the remaining uses, and the definitions
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The regular rexpression operation POSIXA works on any of the (currently)
16 posix classes (like \w and [:graph:]) under the regex modifier /a.
This commit creates similar operations for the other modifiers: POSIXL
(for /l), POSIXD (for /d), POSIXU (for /u), plus their complements.
It causes these ops to be generated instead of the ALNUM, DIGIT,
HORIZWS, SPACE, and VERTWS ops, as well as all their variants. The net
saving is 22 regnode types.
The reason to do this is for maintenance. As of this commit, there are
now 22 fewer node types for which code has to be maintained. The code
for each variant was essentially the same logic, but on different
operands. It would be easy to make a change to one copy and forget to
make the corresponding change in the others. Indeed, this patch fixes
[perl #114272] in which one copy was out of sync with others.
This patch actually reduces the number of separate code paths to 5:
POSIXA, NPOSIXA, POSIXL, POSIXD, and POSIXU. The complements of the
last 3 use the same code path as their non-complemented version, except
that a variable is initialized differently. The code then XORs this
variable with its result to do the complementing or not. Further, the
POSIXD branch now just checks if the target string being matched is
UTF-8 or not, and then jumps to either the POSIXU or POSIXA code
respectively. So, there are effectively only 4 cases that are coded:
POSIXA, NPOSIXA, POSIXL, and POSIXU. (POSIXA doesn't have to worry
about UTF-8, while NPOSIXA does, hence these for efficiency are coded
separately.)
Removing all this code saves memory. The output of the Linux size
command shows that the perl executable was shrunk by 33K bytes on my
platform compiled under -O0 (.7%) and by 18K bytes (1.3%) under -O2.
The reason this patch was doable was previous work in numbering the
POSIX classes, so that they could be indexed in arrays and bit
positions. This is a large patch; I didn't see how to break it into
smaller components.
I chose to make this code more efficient as opposed to saving even more
memory. Thus there is a separate loop that is jumped to after we know
we have to load a swash; this just saves having to test if the swash is
loaded each time through the loop. I avoid loading the swash until
absolutely necessary. In places in the previous version of this code,
the swash was loaded when the input was UTF-8, even if it wasn't yet
needed (and might never be if the input didn't contain anything above
Latin1); apparently to avoid the extra test per iteration.
The Perl test suite runs slightly faster on my platform with this patch
under -O0, and the speeds are indistinguishable under -O2. This is in
spite of these new POSIX regops being unknown to the regex optimizer
(this will be addressed in future commits), and extra machine
instructions being required for each character (the xor, and some
shifting and masking). I expect this is a result of better caching, and
not loading swashes unless absolutely necessary.
|
|
|
|
|
| |
I didn't plan very well when I added these macros recently. This
refactors them to be more logical.
|
|
|
|
|
|
| |
This patch creates an array pointing to the inversion lists that cover
the Latin-1 ranges for Posix character classes, and uses it instead of
the individual variables previously referred to.
|
|
|
|
|
|
| |
This changes to get the name for the character class's Unicode property
via table lookup. This is in preparation for making most of the cases
in this switch identical, so they can be collapsed.
|
|
|
|
| |
Move them to the section that is for back-compat definitions.
|
|
|
|
|
|
| |
This function uses table lookup to replace 9 more specific functions,
which can be deprecated. They should not have been exposed to the
public API in the first place
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Perl has had an undocumented macro isALNUMC() for a long time. I want
to document it, but the name is very obscure. Neither Yves nor I are
sure what it is. My best guess is "C's alnum". It corresponds to
/[[:alnum:]]/, and so its best name would be isALNUM(). But that is the
name long given to what matches \w. A new synonym, isWORDCHAR(), has
been in place for several releases for that, but the old isALNUM()
should remain for backwards compatibility.
I don't think that the name isALNUMC() should be published, as it is too
close to isALNUM(). I finally came to the conclusion that
isALPHANUMERIC() is the best name; it describes its purpose clearly; the
disadvantage is its long length. I doubt that it will get much use, but
we need something, I think, that we can publish to accomplish this
functionality.
This commit also converts core uses of isALNUMC to isALPHANUMERIC. (I
intended to that separately, but made a mistake in rebasing, and
combined the two patches; and it seemed like not a big enough problem to
separate them out again.)
|
|
|
|
|
|
| |
I'm moving this block of back-compat macros to later in the file, so
it comes after all the other definitions that may need to have backwards
compatibility equivalents
|
| |
|
| |
|
|
|
|
|
|
| |
This saves 1.5 KB in the text section on my machine in regexec.o
(unoptimized) and 820 optimized. I did not benchmark, as we don't
really care very much about performance under 'use locale'.
|
|
|
|
|
|
|
| |
This creates a copy of all the Posix character class numbers and puts
them in an enum. This enum is for internal Perl core use only, and is
used so hopefully compilers can generate better code from future commits
that will make use of it.
|
|
|
|
|
|
| |
This groups the Posix-like classes in two groups, one which contains
those classes whose above-Latin1 lookups are done via swashes; the other
which aren't. This will prove useful in future commits.
|
| |
|
|
|
|
|
| |
There are no digits in the upper Latin1 range, therefore we can skip
testing for such.
|
|
|
|
|
|
|
|
|
|
|
| |
This documents several more of the character classification macros,
including all variants of them. There are no code changes.
The READ_XDIGIT macro was moved to "Miscellaneous Functions", as it
really isn't character classification.
Several of the macros remain undocumented because I'm not comfortable
yet about their names/and or functionality.
|
|
|
|
|
| |
This should make it easier to find things. No code changes, but there
are some comment changes
|
|
|
|
|
|
| |
I don't believe there are platforms that this is wrong on, but using the
_A suffix clearly indicates that only ASCII-range characters can match
this, like its cohort macros that surround it.
|
|
|
|
|
|
|
|
|
|
| |
This macro can be defined in terms of the foo_uvchr() version. It
should be correct on platforms that have an isblank() function in the C
library. I don't know why this macro exists. It doesn't correspond to
any of the other ones (though a recent commit removed one it did
correspond to, but which can't have been in use because it expanded to a
non-existent function). I'm leaving it in just for back compat. I did
not add tests for this macro.
|
|
|
|
| |
This makes things line up in columns and not exceed 80 columns in width.
|
|
|
|
|
|
|
| |
Under the default Posix locale, \s and [[:space:]] are the same, so
there is no need to try to make sure that [[:space:]] matches a vertical
tab -- it already does. Also one of the macros had a typo, trying to
add a form feed instead of a vertical tab
|
|
|
|
|
|
|
| |
We think this is meant to stand for C's alphanumeric, that is what is
matched by POSIX [:alnum:]. There were not functions and a dedicated
swash available for accessing it. Future commits will want to use
these.
|
|
|
|
|
| |
Not all character classifications had macros. This commit adds all but
one of the missing ones. It will be added in a separate commit.
|
|
|
|
|
|
|
|
|
| |
For some time, WORDCHAR has been preferred to ALNUM because of the
nearly identical ALNUMC which means something else (the C language
definition of alnum). This adds macros for WORDCHAR, while retaining
ALNUM for backwards compatibility.
Also, another macro is redefined using WORDCHAR in preference to ALNUM
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These macros check if a UTF-8 encoded character is of particular types
for use with locales. Prior to this patch, they called a function to
convert the character to a code point value. This was used as input to
another macro that handles code points. For values above the Latin1
range, that macro calls a function, which converts back to UTF-8 and
calls another function.
This commit changes that to call the UTF-8 function directly for
above-Latin1 code points. No conversion need be done. For Latin1 code
points, it converts, if necessary, to the code point and calls a macro
that handles these directly.
Some of these macros now use a macro instead of a function call for
above-Latin1 code points, as is done in various other places in this
file.
|
|
|
|
|
|
| |
There is a macro that returns the same as the function call previously
used in the SPACE macro; and nothing above Latin1 can possibly match the
CNTRL macro
|
|
|
|
|
| |
This allows the common parts of the definitions of these to all use the
same logic
|
|
|
|
|
|
|
|
| |
Prior to this commit, if you called these macros with something outside
the Latin1 range, the return value was not defined, subject to the whims
of your particular C compiler and library. This commit causes all the
boolean macros to return FALSE for non-Latin1 inputs; and all the map
macros to return their inputs
|
|
|
|
|
|
|
|
|
| |
This macro expands to a function or macro call that does not exist, so
this macro itself can't be being used by anyone. It also doesn't fit
the paradigm of the other macros above it, being defined in terms of
uni instead of uvchr; nor does it really gain anything, since \s is a
posix space under locale. The \f also appears to be a typo, based on
the commit message, it should have been \v.
|
|
|
|
| |
This was missed in commit 075b9d7d9a6d4473b240a047655e507c8baa6db3
|
|
|
|
|
|
|
| |
These two macros should have the same results for the same input code
points. Prior to this patch, the _uni() macro returned the official
Unicode ID_Start property, and the _utf8() macro returned Perl's
slightly restricted definition. Now both return Perl's.
|
|
|
|
|
|
| |
This function was added in 5.16, and has no callers in CPAN. It is
undocumented and marked as changeable. Its name has two underscores in
a row by mistake. This removes one of them.
|
|
|
|
|
|
|
|
|
|
|
| |
This refactors the isSPACE_uni, is_SPACE_utf8, isPSXSPC_uni,
and is_PSXSPC_utf8 macros in handy.h, so that no function call need be
done to handle above Latin1 input. These macros are quite small, and
unlikely to grow over time, as Unicode has mostly finished adding white
space equivalents to the Standard. The functions that implement these
in utf8.c are also changed to use the macros instead of generating a
swash. This should speed things up slightly, with less memory used over
time as the swash fills.
|
|
|
|
|
|
|
|
|
|
| |
This adds macros to regen/regcharclass.pl that are usable as part of the
is_XDIGIT_foo() macros in handy.h, so that no function call need be done
to handle above Latin1 input. These macros are quite small, and
unlikely to grow over time. The functions that implement these in
utf8.c are also changed to use the macros instead of generating a swash.
This should speed things up slightly, with less memory used over time as
the swash fills.
|
|
|
|
|
|
|
|
|
|
|
| |
This adds macros to regen/regcharclass.pl that are usable as part of the
is_BLANK_foo() macros in handy.h, so that no function call need be done
to handle above Latin1 input. These macros are quite small, and
unlikely to grow over time, as Unicode has mostly finished adding white
space equivalents to the Standard. The functions that implement these
in utf8.c are also changed to use the macros instead of generating a
swash. This should speed things up slightly, with less memory used over
time as the swash fills.
|
|
|
|
|
| |
These two macros match the same things as \v does in patterns. I'm
leaving them undocumented for now.
|
|
|
|
|
|
|
| |
All controls will always be in the Latin1 range by Unicode's stability
policy. This means that we don't have to call is_utf8_cntrl() when the
input to the is_CNTRL_utf8() macro is above Latin1; we can just fail.
And that means that Perl_is_utf8_cntrl() can just use the macro.
|
|
|
|
|
| |
This refactors these macros so that other macros automatically add aTHX_
if necessary.
|
|
|
|
|
|
|
| |
This will be used in future commits to allow \v and \V to be treated
consistently with other character classes. (Doing the same for \h isn't
necessary, as it matches identically to [:blank:] in the entire Unicode
range.)
|
| |
|
|
|
|
|
|
| |
This changes the names of some macros to begin with an underscore, and
changes comments to more clearly indicate things which aren't to be used
outside the Perl core.
|
|
|
|
|
|
|
| |
This refactors the macro builder macros to accept a class number instead
of a macro name. This is easier to understand than having to use
CAT2(), and it allows for a potential future commit to use these at
run-time, given a class number.
|
|
|
|
|
|
| |
These three outliers don't have to be. They can use the same
constructed form as the others surrounding them. One requires a
temporary #define which will be removed in a future commit
|
|
|
|
|
|
|
| |
isWORDCHAR_uni() and isWORDCHAR_utf8() are defined for consistency with
the other isWORDCHAR...() macros
The isALNUM_foo() versions are retained for backwards compatibility
|