| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
| |
The recently introduced macro isMNEMONIC_CNTRL has a look-up and several
tests in it, which occupy time and space. Since it was only used for
debugging, that did not matter much, but future commits will use it in
more mainline code. This commit changes it to be a single look-up,
using up one of the spare bits available for that purpose in
PL_charclass. There are enough available bits that we aren't likely to
run out, really ever. (We can always add a 2nd word of bits if
necessary.)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit also causes escaped (by a backslash) "(", "[", and "{" to be
considered literally. In the previous 2 Perl versions, the escaping was
ignored, and a (default-on) deprecation warning was raised. Now that we
have warned for 2 release cycles, we can change the meaning.of escaping
to actually do something
Warning when a literal left brace is not escaped by a backslash, will
allow us to eventually use this character in more contexts as being
meta, allowing us to extend the language. For example, the lower limit
of a quantifier could be omited, and better error checking instituted,
or things like \w could be followed by a {...} indicating some special
word character, like \w{Greek} to restrict to just Greek word
characters.
We tried to do this in v5.16, and many CPAN modules changed to backslash
their left braces at that time. However we had to back out that change
before 5.16 shipped because it turned out that escaping a left brace in
some contexts didn't work, namely when the brace would normally be a
metacharacter (for example surrounding a quantifier), and the pattern
delimiters were { }. Instead we raised the useless backslash warning
mentioned above, which has now been there for the requisite 2 cycles.
This patch partially reverts 2 patches. The first,
e62d0b1335a7959680be5f7e56910067d6f33c1f, partially reverted
the deprecation of unescaped literal left brace. The other,
4d68ffa0f7f345bc1ae6751744518ba4bc3859bd, instituted the deprecation of
the useless left-characters.
Note that, as in the original attempt to deprecate, we don't raise a
warning if the left brace is the first character in the pattern. This
is because in that position it can't be a metacharacter, so we don't
require any disambiguation, and we found that if we did raise an error,
there were quite a few places where this occurred.
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are a few characters in the Latin1 range that can be folded to by
above-Latin1 characters. Some of these are folded to as part of a
single character fold, like KELVIN SIGN folds to 'k'. More are folded
to as part of a multi-character fold. Until this commit, there wasn't a
quick way to distinguish between the two classes. A couple of places
only want the single-character ones. It is more efficient to look for
just those than to include the multi-char ones which end up not doing
anything. This uses a bit in l1_char_class_tab.h to indicate those
characters that are in the desired class.
|
|
|
|
|
| |
This causes the generated l1_char_class_tab.h to be valid on all
supported platforms
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since this program was written, the abbreviated names of the control
characters have become available from charnames::viacode(). We change
to use these instead of hard-coding them in.
At the same time, this shortens the names for some of the other
characters in cases where it is easy to read the short ones.
It also changes to use mnemonics instead of hard-coded ordinals, like
using ASCII instead of x < 128. This allows it to be run on an EBCDIC
platform.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are three pairs of characters that Perl recognizes as
metacharacters in regular expression patterns: {}, [], and (). These
can be used as well to delimit patterns, as in:
m{foo}
s(foo)(bar)
Since they are metacharacters, they have special meaning to regular
expression patterns, and it turns out that you can't turn off that
special meaning by the normal means of preceding them with a backslash,
if you use them, paired, within a pattern delimitted by them. For
example, in
m{foo\{1,3\}}
the backslashes do not change the behavior, and this matches "f", "o"
followed by one to three more occurrences of "o".
Usages like this, where they are interpreted as metacharacters, are
exceedingly rare; we think there are none, for example, in all of CPAN.
Hence, this deprecation should affect very little code. It does give
notice, however, that any such code needs to change, which will in turn
allow us to change the behavior in future Perl versions so that the
backslashes do have an effect, and without fear that we are silently
breaking any existing code.
=head1 Performance Enhancements
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
/[[:upper:]]/i and /[[:lower:]]/i should match the Unicode property
\p{Cased}. This commit introduces a pseudo-Posix class, internally named
'cased', to represent this. This class isn't specifiable by the user,
except through using either /[[:upper:]]/i or /[[:lower:]]/i. Debug
output will say ':cased:'.
The regex parsing either of :lower: or :upper: will change them into
:cased:, where already existing logic can handle this, just like any
other class.
This commit fixes the regression introduced in
3018b823898645e44b8c37c70ac5c6302b031381, and that these have never
worked under 'use locale'. The next commit will un-TODO the tests for
these things.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Perl has had an undocumented macro isALNUMC() for a long time. I want
to document it, but the name is very obscure. Neither Yves nor I are
sure what it is. My best guess is "C's alnum". It corresponds to
/[[:alnum:]]/, and so its best name would be isALNUM(). But that is the
name long given to what matches \w. A new synonym, isWORDCHAR(), has
been in place for several releases for that, but the old isALNUM()
should remain for backwards compatibility.
I don't think that the name isALNUMC() should be published, as it is too
close to isALNUM(). I finally came to the conclusion that
isALPHANUMERIC() is the best name; it describes its purpose clearly; the
disadvantage is its long length. I doubt that it will get much use, but
we need something, I think, that we can publish to accomplish this
functionality.
This commit also converts core uses of isALNUMC to isALPHANUMERIC. (I
intended to that separately, but made a mistake in rebasing, and
combined the two patches; and it seemed like not a big enough problem to
separate them out again.)
|
|
|
|
|
|
|
| |
This will be used in future commits to allow \v and \V to be treated
consistently with other character classes. (Doing the same for \h isn't
necessary, as it matches identically to [:blank:] in the entire Unicode
range.)
|
|
|
|
|
|
|
|
|
|
|
| |
This commit uses the mktables defined table for whether or not a
character is a legitimate charname continuation. This will allow it to
be kept in sync with other code that needs the definition.
The only change this makes is to delete "colon" from being a legitimate
continuation character. A colon was only accepted because it was used
in the paradigm for like "Greek: Alpha", and is not part of any
actual character name.
|
|
|
|
|
|
|
|
|
| |
This code is for just this property and was kludged in to be executed in
the general loop. It makes more sense to it to be in the subroutine
that handles the property that was just added in a prior commit.
It also changes the output slightly. The Latin1 sharp S isn't a
non-final fold, unlike what was said previously
|
|
|
|
|
|
| |
This takes the existing mktables-generated table that lists all
characters that participate in any way in a fold, and creates a bit for
it in l1_char_class_tab.h
|
|
|
|
|
|
|
|
|
|
| |
This starts with the existing table that mktables generates that lists
all the characters in Unicode that occur in multi-character folds, and
aren't in the final positions of any such fold.
It generates data structures with this information to make it quickly
available to code that wants to use it. Future commits will use these
tables.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This array is a bit map containing the Posix and similar character
classes for the first 256 code points. Prior to this commit many
character classes were represented by two bits, one for characters that
are in it over the full Latin-1 range, and one for just the ASCII
characters that are in it. The number of bits in use was approaching
the 32-bit limit available without playing games.
This commit takes advantage of a recent commit that adds a bit to the
table for all the ASCII characters, and the fact that the ASCII
characters in a character class are a subset of the full Latin1
range. So, iff both the full-range character class bit and the ASCII
bit is set is that character an ASCII-range character with the given
character class.
A new internal macro is created to generate code to determine if a
character is an ASCII range character with the given class. It's not
clear if the generated code is faster or slower than the full range
version.
The result is that nearly half the bits are freed up, as the ones for
the ASCII-range are now redundant.
|
|
|
|
|
|
| |
This changes the #defines to be just the shift number, while doing
the shifting in the macro that the number is passed to. This will prove
useful in future commits
|
|
|
|
|
|
|
| |
This does not replace the isASCII macro definition, as I think the
current one is more efficient than this one provides. But future
commits will rely on all the named character classes (e.g.,
/[[:ascii:]]/) having a bit, and this is the only one missing.
|
|
|
|
|
|
| |
If the version of Unicode being compiled doesn't have the modern
casefolding .txt file, get the values from Unicode::UCD. Also for
EBCDIC, where otherwise the file would have to be translated.
|
|
|
|
|
|
|
|
|
|
|
|
| |
The new definition is likely slightly faster, as it replaces an array
lookup with a mask.
Comments are also added, listing the other possible candidates for this
treatment, though the speed differential is unclear as they would also
add an extra test.
A U32 is used to store the information about the various properties for
a character. This frees up one bit of that for future other use.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit is the minimal necessary to get \s to match the vertical
tab. It is being done early in the 5.17 series in order to see what
repercussions there might be from doing this.
It may well be that we decide that this change will require a 'use
feature' to activate. In any event there is significant documentation
of the behavior without the VT that this patch does not address at all.
Tom Christiansen asked Larry Wall why \s did not include VT, and
reported that Larry replied that he did not remember, but had no
objections to adding it.
|
|
|
|
|
| |
This changes this header to include a bit for each character indicating
if it should be quoted by quotemeta under unicode_strings
|
|
|
|
|
|
| |
This commit delivers the official Unicode character database files for
release 6.1, plus the final bits needed to cope with the changes in them
from release 6.0, including documentation.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 88c8c9616516015e2fe0b502cdb92dc4efcc0c10.
It turns out that these multi-char fold targets are now needed;
In a future commit, I plan to compile in the dozen or so rules that
are needed to avoid a Latin1-only regex from having to go out to the
utf8 tables to avoid the performance penalty; or calling code can use
the also forthcoming 'use re "/aa"'.
|
|
|
|
|
|
|
| |
These are not currently used, and slow things down, as regular
expressions that have them, such as /[Etl]/i now have to go out and load
utf8 code. This remains the case, though, for bracketed character
classes that include [KkSs].
|
| |
|
|
|
|
|
|
|
| |
Change it to read CaseFolding.txt from lib/unicore, instead of the file
installed with perl, so that it can run with an uninstalled perl.
Add "read only" editor blocks to l1_char_class_tab.h
|
|
|
|
|
| |
Now the contents of l1_char_class_tab.h is only the output of
Porting/mk_PL_charclass.pl
|
|
|
|
| |
This patch is the result of running mk_PL_charclass.pl
|
|
|
|
|
|
| |
The output of the revised Porting/mk_charclass.pl is here incorporated
into this .h., with a #define for the new bit that signifies if a
character participates in a fold with a non-latin1 character.
|
|
|
|
|
| |
The generated table was wrong in the Latin1 range for characters with
the ALNUMC property
|
|
This patch adds a table for looking up character classes. It is 256
words long, in l1_char_class_tab.h, with each word corresponding to the
ordinal of a Latin1 character, and each word contains a bit map of all
the properties that character matches. Each property has a bit or two.
Ones named _CC_property_A are true only if the character is also in the
ASCII character set. Ones named CC_property_L1 do not have this
restriction. (L1 stands for Latin1.)
Also added is a script that generates the table. It is not anticipated
that this will need to be used often.
(This commit was changed from its original form by Steffen.)
|