| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are three pairs of characters that Perl recognizes as
metacharacters in regular expression patterns: {}, [], and (). These
can be used as well to delimit patterns, as in:
m{foo}
s(foo)(bar)
Since they are metacharacters, they have special meaning to regular
expression patterns, and it turns out that you can't turn off that
special meaning by the normal means of preceding them with a backslash,
if you use them, paired, within a pattern delimitted by them. For
example, in
m{foo\{1,3\}}
the backslashes do not change the behavior, and this matches "f", "o"
followed by one to three more occurrences of "o".
Usages like this, where they are interpreted as metacharacters, are
exceedingly rare; we think there are none, for example, in all of CPAN.
Hence, this deprecation should affect very little code. It does give
notice, however, that any such code needs to change, which will in turn
allow us to change the behavior in future Perl versions so that the
backslashes do have an effect, and without fear that we are silently
breaking any existing code.
=head1 Performance Enhancements
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
/[[:upper:]]/i and /[[:lower:]]/i should match the Unicode property
\p{Cased}. This commit introduces a pseudo-Posix class, internally named
'cased', to represent this. This class isn't specifiable by the user,
except through using either /[[:upper:]]/i or /[[:lower:]]/i. Debug
output will say ':cased:'.
The regex parsing either of :lower: or :upper: will change them into
:cased:, where already existing logic can handle this, just like any
other class.
This commit fixes the regression introduced in
3018b823898645e44b8c37c70ac5c6302b031381, and that these have never
worked under 'use locale'. The next commit will un-TODO the tests for
these things.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Perl has had an undocumented macro isALNUMC() for a long time. I want
to document it, but the name is very obscure. Neither Yves nor I are
sure what it is. My best guess is "C's alnum". It corresponds to
/[[:alnum:]]/, and so its best name would be isALNUM(). But that is the
name long given to what matches \w. A new synonym, isWORDCHAR(), has
been in place for several releases for that, but the old isALNUM()
should remain for backwards compatibility.
I don't think that the name isALNUMC() should be published, as it is too
close to isALNUM(). I finally came to the conclusion that
isALPHANUMERIC() is the best name; it describes its purpose clearly; the
disadvantage is its long length. I doubt that it will get much use, but
we need something, I think, that we can publish to accomplish this
functionality.
This commit also converts core uses of isALNUMC to isALPHANUMERIC. (I
intended to that separately, but made a mistake in rebasing, and
combined the two patches; and it seemed like not a big enough problem to
separate them out again.)
|
|
|
|
|
|
|
| |
This will be used in future commits to allow \v and \V to be treated
consistently with other character classes. (Doing the same for \h isn't
necessary, as it matches identically to [:blank:] in the entire Unicode
range.)
|
|
|
|
|
|
|
|
|
|
|
| |
This commit uses the mktables defined table for whether or not a
character is a legitimate charname continuation. This will allow it to
be kept in sync with other code that needs the definition.
The only change this makes is to delete "colon" from being a legitimate
continuation character. A colon was only accepted because it was used
in the paradigm for like "Greek: Alpha", and is not part of any
actual character name.
|
|
|
|
|
|
|
|
|
| |
This code is for just this property and was kludged in to be executed in
the general loop. It makes more sense to it to be in the subroutine
that handles the property that was just added in a prior commit.
It also changes the output slightly. The Latin1 sharp S isn't a
non-final fold, unlike what was said previously
|
|
|
|
|
|
| |
This takes the existing mktables-generated table that lists all
characters that participate in any way in a fold, and creates a bit for
it in l1_char_class_tab.h
|
|
|
|
|
|
|
|
|
|
| |
This starts with the existing table that mktables generates that lists
all the characters in Unicode that occur in multi-character folds, and
aren't in the final positions of any such fold.
It generates data structures with this information to make it quickly
available to code that wants to use it. Future commits will use these
tables.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This array is a bit map containing the Posix and similar character
classes for the first 256 code points. Prior to this commit many
character classes were represented by two bits, one for characters that
are in it over the full Latin-1 range, and one for just the ASCII
characters that are in it. The number of bits in use was approaching
the 32-bit limit available without playing games.
This commit takes advantage of a recent commit that adds a bit to the
table for all the ASCII characters, and the fact that the ASCII
characters in a character class are a subset of the full Latin1
range. So, iff both the full-range character class bit and the ASCII
bit is set is that character an ASCII-range character with the given
character class.
A new internal macro is created to generate code to determine if a
character is an ASCII range character with the given class. It's not
clear if the generated code is faster or slower than the full range
version.
The result is that nearly half the bits are freed up, as the ones for
the ASCII-range are now redundant.
|
|
|
|
|
|
| |
This changes the #defines to be just the shift number, while doing
the shifting in the macro that the number is passed to. This will prove
useful in future commits
|
|
|
|
|
|
|
| |
This does not replace the isASCII macro definition, as I think the
current one is more efficient than this one provides. But future
commits will rely on all the named character classes (e.g.,
/[[:ascii:]]/) having a bit, and this is the only one missing.
|
|
|
|
|
|
| |
If the version of Unicode being compiled doesn't have the modern
casefolding .txt file, get the values from Unicode::UCD. Also for
EBCDIC, where otherwise the file would have to be translated.
|
|
|
|
|
|
|
|
|
|
|
|
| |
The new definition is likely slightly faster, as it replaces an array
lookup with a mask.
Comments are also added, listing the other possible candidates for this
treatment, though the speed differential is unclear as they would also
add an extra test.
A U32 is used to store the information about the various properties for
a character. This frees up one bit of that for future other use.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit is the minimal necessary to get \s to match the vertical
tab. It is being done early in the 5.17 series in order to see what
repercussions there might be from doing this.
It may well be that we decide that this change will require a 'use
feature' to activate. In any event there is significant documentation
of the behavior without the VT that this patch does not address at all.
Tom Christiansen asked Larry Wall why \s did not include VT, and
reported that Larry replied that he did not remember, but had no
objections to adding it.
|
|
|
|
|
| |
This changes this header to include a bit for each character indicating
if it should be quoted by quotemeta under unicode_strings
|
|
|
|
|
|
| |
This commit delivers the official Unicode character database files for
release 6.1, plus the final bits needed to cope with the changes in them
from release 6.0, including documentation.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 88c8c9616516015e2fe0b502cdb92dc4efcc0c10.
It turns out that these multi-char fold targets are now needed;
In a future commit, I plan to compile in the dozen or so rules that
are needed to avoid a Latin1-only regex from having to go out to the
utf8 tables to avoid the performance penalty; or calling code can use
the also forthcoming 'use re "/aa"'.
|
|
|
|
|
|
|
| |
These are not currently used, and slow things down, as regular
expressions that have them, such as /[Etl]/i now have to go out and load
utf8 code. This remains the case, though, for bracketed character
classes that include [KkSs].
|
| |
|
|
|
|
|
|
|
| |
Change it to read CaseFolding.txt from lib/unicore, instead of the file
installed with perl, so that it can run with an uninstalled perl.
Add "read only" editor blocks to l1_char_class_tab.h
|
|
|
|
|
| |
Now the contents of l1_char_class_tab.h is only the output of
Porting/mk_PL_charclass.pl
|
|
|
|
| |
This patch is the result of running mk_PL_charclass.pl
|
|
|
|
|
|
| |
The output of the revised Porting/mk_charclass.pl is here incorporated
into this .h., with a #define for the new bit that signifies if a
character participates in a fold with a non-latin1 character.
|
|
|
|
|
| |
The generated table was wrong in the Latin1 range for characters with
the ALNUMC property
|
|
This patch adds a table for looking up character classes. It is 256
words long, in l1_char_class_tab.h, with each word corresponding to the
ordinal of a Latin1 character, and each word contains a bit map of all
the properties that character matches. Each property has a bit or two.
Ones named _CC_property_A are true only if the character is also in the
ASCII character set. Ones named CC_property_L1 do not have this
restriction. (L1 stands for Latin1.)
Also added is a script that generates the table. It is not anticipated
that this will need to be used often.
(This commit was changed from its original form by Steffen.)
|