| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
| |
Fix spelling on various files pertaining to core Perl.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This splits a bunch of the subcomponents of the regex engine into
smaller files.
regcomp_debug.c
regcomp_internal.h
regcomp_invlist.c
regcomp_study.c
regcomp_trie.c
The only real change besides to the build machine to achieve the split
is to also adds some new defines which can be used in embed.fnc to control
exports without having to enumerate /every/ regex engine file. For
instance all of regcomp*.c defines PERL_IN_REGCOMP_ANY, and this is used
in embed.fnc to manage exports.
|
|
|
|
|
|
|
|
|
|
|
| |
Unicode has lots of arrows of various shapes, sizes, and directions.
None of them were of consequence to the Bidirectional algorithm, so none
were specified as being mirrored pairs. This commit uses the
generalizations already in place from previous commits to examine arrow
symbols and choose which are mirrored pairs.
As previously, it rejects arrows with contrary directionality, and ones
without horizontal directionality.
|
|
|
|
| |
The characters with this name look good as mirrored delimiters.
|
|
|
|
| |
The characters with this name look good as mirrored delimiters.
|
|
|
|
| |
The characters with this name look good as mirrored delimiters.
|
|
|
|
| |
The characters with this name look good as mirrored delimiters.
|
|
|
|
| |
The characters with this name look good as mirrored delimiters.
|
|
|
|
|
| |
The characters that signify the beginning and ending of Western music
scores serve as good delimiters
|
|
|
|
|
|
| |
The bidi-aware characters containing this word are visually suitable for
being mirrored delimiters. The 'index' refers to the index finger
in a hand pointing at the delimited string
|
|
|
|
|
| |
The bidi-aware characters containing this word are visually suitable for
being mirrored delimiters.
|
|
|
|
|
| |
The bidi-aware characters containing this word are visually suitable for
being mirrored delimiters.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Another way Unicode indicates that a character has horizontal
directionality is by adding LEFT or RIGHT to the name of a base
character. Hence we get RIGHT SPEAKER vs just plain SPEAKER.
Presumably this comes about when they didn't consider directionality at
first, and then realized later it was needed.
This commit makes the script look for these kinds of character pairs.
Because the current Unicode version only has this characteristic for
Symbols, and symbols must be included explicitly, no changes in what
gets paired ensues. But if you turn on the outputting of characters not
chosen, that list will now include things meeting this new criteria.
Less than a handful actually are like this.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Heretofore, the code looking for paired string delimiters has looked at
punctuation, and a few symbols that Unicode gives a mirror for. But
there are many more suitable-for-pairing characters in Unicode.
This commit generalizes things so as to handle the extra complexities of
the way symbols are named beyond the punctuation names. For example,
RIGHTWARDS is sometimes used; it turns out that it also is used in one
punctuation character, which was previously overlooked by this script.
The generalization introduced by this commit handles almost all current
Unicode symbols properly.
But some symbols are barely distinguishable from their mirrors, such as
a tilde and a reversed tilde. The scheme adopted here, then, makes the
default for a symbol pair to not be marked as paired delimiters. The
code explicitly has to specify that a given pair is to be included.
The next few commits are mostly for adding ones that I thought were
good.
|
|
|
|
|
| |
This commit adds 8 pairs of symbols that are variants on ELEMENT OF
These make nice paired delimiters in the vein of < >
|
|
|
|
|
| |
This commit adds 20 pairs of symbols that are variants on SUBSET
These make nice paired delimiters in the vein of < >
|
|
|
|
|
| |
This commit adds 15 pairs of symbols that are variants on PRECEDES.
These look a lot like <>, so makes sense to make them paired delimiters.
|
|
|
|
|
| |
This commit adds 2 pairs of symbols that are variants on SMALLER THAN.
These look a lot like <>, so makes sense to make them paired delimiters.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, only the punctuation characters that Unicode had classed as
being opening/closing were considered in looking for suitable paired
delimiters.
This commit looks at all punctuation characters. There are actually
only 7 new pairs found.
This gives us ꧁ ꧂ as string delimiterss, if your font allows,
which are Javanese and used to surround an honorific title, according to
Wikipedia.
|
|
|
|
|
|
|
| |
Perl considers '< >' to be delimiters for strings; this commit adds
most of the Unicode variants of these to also be string delimiters. The
ones that are combinations of both < and >, aren't included, as that
would be visually confusing.
|
|
|
|
|
|
|
|
| |
Besides LEFT/RIGHT, horizontal directionality can be specified by
Unicode in names by the presence or absence of REVERSED.
Enhancing the algorithm to take this into account adds 2 pairs or
mirrored delimiters that were previously overlooked.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This script now examines all punctuation characters to see if there is a
mirrored character for it, suitable for use as a Perl string delimiter.
Some don't qualify, and some do qualify but the script doesn't catch
them.
This commit adds the ability to output which characters it doesn't think
qualify, and why. This enables a maintainer to easily check and know
what its deficiencies are, or that there is a good reason that a
particular character gets rejected.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, only characters that Unicode included in its bidirectional
algorithm have been eligible to be found by this program to be mirrored
string delimiters.
This commit adds 5 quotation marker character pairs that
are omitted from the bidirectional algorithm, as most quotes are,
because, as the Standard says, their "directionality and pairing status
is less predictable than paired brackets."
But we're not particularly interested in those semantics, most string
delimiters will be selected only for their visual appearance.
Because they aren't in the bidi algorithm, there is no property that
maps one member of a pair to its mate. However, Two characters whose
names pair only by LEFT vs RIGHT are almost certainly a mirrored pair.
This doesn't catch all possibilities; future commits will expand the
ones caught.
The commit refactors things so as to make future commits easier which
look at even more delimiter possibilities.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unicode says certain opening punctuation characters may be used as
closing ones in some languages; and their mirror is instead the opening
one.
This commit changes to allow either one of each such set to be the
opening one.
It also deprecates the use of any of the new mirrored delimiters to be
used outside the feature as an unmirrored delimiter, and the normal
closing delimiter from being used as an unpaired opening one while in
the feature. This gives us the freedom to make some or all of the new
paired delimiters be reversible.
|
|
|
|
|
|
| |
This adds the capability to temporarily change a scalar to true to cause
this to print on stderr a list of the paired string delimiters, suitable
for pasting into a pod.
|
|
|
|
|
|
|
|
|
| |
This commit causes several C strings to be generated containing bytes
that match paired string delimiters beyond the four that have
traditionally been used in Perl. This will allow a future commit to
accept more matching delimiters around strings than those four.
The code explains how the added delimiters are chosen.
|
|
|
|
|
| |
This is in preparation for it to be used in multiple places in a future
commit.
|
|
|
|
| |
Align the output of this bit vertically with surrounding output.
|
|
|
|
|
|
| |
The names were intended to force people to not use them outside their
intended scopes. But by restricting those scopes in the first place, we
don't need such unwieldy names
|
|
|
|
|
|
|
|
|
|
|
| |
This feature allows documentation destined for perlapi or perlintern to
be split into sections of related functions, no matter where the
documentation source is. Prior to this commit the line had to contain
the exact text of the title of the section. Now it can be a $variable
name that autodoc.pl expands to the title. It still has to be an exact
match for the variable in autodoc, but now, the expanded text can be
changed in autodoc alone, without other files needing to be updated at
the same time.
|
|
|
|
| |
which will be needed in a future commit
|
|
|
|
|
| |
apidoc_section is slightly favored over head1, as it is known only to
autodoc, and can't be confused with real pod.
|
|
|
|
|
|
|
|
|
|
| |
This makes various fixes to the text that is used to generate the
documentation. The dominant change is to add the 'n' flag to indicate
that the macro takes no arguments. A couple should have been marked
with a D (for deprecated) flag, and a couple were missing parameters,
and a couple were missing return values.
These were spotted by using Devel::PPPort on them.
|
|
|
|
| |
This will be needed in a future commit
|
|
|
|
|
|
|
|
|
|
| |
Like the previous commit, this code is adding the UTF-8 for a Greek
character to a string. It previously used Copy, but this character is
representable as two bytes in both ASCII and EBCDIC UTF-8, the only
character sets that Perl will ever supports, so we can use the
specialized code that is used most everywhere else for two byte UTF-8
characters, avoiding the function overhead, and having to treat this
character as particularly special.
|
|
|
|
|
|
|
|
|
| |
This code is adding the UTF-8 for a Greek character to a string. It
previously used Copy, but this character is representable as two bytes
in both ASCII and EBCDIC UTF-8, the only character sets that Perl will
ever supports, so we can use the specialized code that is used most
everywhere else for two byte UTF-8 characters, avoiding the function
overhead, and having to treat this character as particularly special.
|
|
|
|
|
| |
We need the length of the UTF-8 for this code point elsewhere, and it
is different between ASCII and EBCDIC.
|
|
|
|
|
|
|
|
|
|
| |
We changed to use symbols not likely to be used by non-Perl code that
could conflict, and which have trailing underbars, so they don't look
like a regular Perl #define.
See https://rt.perl.org/Ticket/Display.html?id=131110
There are many more header files which are not guarded.
|
|
|
|
|
| |
require calls now require ./ to be prepended to the file since . is no
longer guaranteed to be in @INC.
|
|
|
|
|
| |
This makes it easy for module authors to write XS code that can use
these characters, and be automatically portable to EBCDIC systems.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As discussed in the previous commit, most code points in Unicode
don't change if upper-, or lower-cased, etc. In fact as of Unicode
v8.0, 93% of the available code points are above the highest one that
does change.
This commit skips trying to case these 93%. A regen/ script keeps track
of the max changing one in the current Unicode release, and skips casing
for the higher ones. Thus currently, casing emoji will be skipped.
Together with the previous commits that dealt with casing, the potential
for huge memory requirements for the swash hashes for casing are
severely limited.
If the following command is run on a perl compiled with -O2 and no
DEBUGGING:
blead Porting/bench.pl --raw --perlargs="-Ilib -X" --benchfile=plane1_case_perf /path_to_prior_perl=before_this_commit /path_to_new_perl=after
and the file 'plane1_case_perf' contains
[
'string::casing::emoji' => {
desc => 'yes swash vs no swash',
setup => 'my $a = "\x{1F570}"', # MANTELPIECE CLOCK
code => 'uc($a)'
},
];
the following results are obtained:
The numbers represent raw counts per loop iteration.
string::casing::emoji
yes swash vs no swash
before_this_commit after
------------------ --------
Ir 981.0 306.0
Dr 228.0 94.0
Dw 100.0 45.0
COND 137.0 49.0
IND 7.0 4.0
COND_m 5.5 0.0
IND_m 4.0 2.0
Ir_m1 0.1 -0.1
Dr_m1 0.0 0.0
Dw_m1 0.0 0.0
Ir_mm 0.0 0.0
Dr_mm 0.0 0.0
Dw_mm 0.0 0.0
|
|
|
|
| |
The previous commit removed all uses of this non-public #define.
|
|
|
|
| |
These will be used in the next commit
|
|
|
|
|
| |
Future commits will want to take different actions depending on which
Unicode version is being used.
|
|
|
|
|
| |
LATIN CAPITAL LETTER SHARP S is not available in all Unicode releases;
simply skip generating things when it isn't there.
|
|
|
|
| |
This input flag was not being properly handled.
|
|
|
|
|
| |
This moves the description of what <DATA> lines should look like to
the __DATA__ line.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A synthetic start class (SSC) is generated by the regular expression
pattern compiler to give a consolidation of all the possible things that
can match at the beginning of where a pattern can possibly match.
For example
qr/a?bfoo/;
requires the match to begin with either an 'a' or a 'b'. There are no
other possibilities. We can set things up to quickly scan for either of
these in the target string, and only when one of these is found do we
need to look for 'foo'.
There is an overhead associated with using SSCs. If the number of
possibilities that the SSC excludes is relatively small, it can be
counter-productive to use them.
This patch creates a crude sieve to decide whether to use an SSC or not.
If the SSC doesn't exclude at least half the "likely" possiblities, it
is discarded. This patch is a starting point, and can be refined if
necessary as we gain experience.
See thread beginning with
http://nntp.perl.org/group/perl.perl5.porters/212644
In many patterns, no SSC is generated; and with the advent of tries,
SSC's have become less important, so whatever we do is not terribly
critical.
|
|
|
|
|
|
|
| |
This creates a #define that gives the highest code point that is an
ASCII printable. On ASCII-ish platforms, this is 0x7E, but on EBCDIC
platforms it varies, and can be as high as 0xFF. This is in preparation
for needing this value in a future commit in regcomp.c
|
|
|
|
| |
These will be used in future commits
|