| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
| |
In particular, remove all instances of 'assert(0);'.
|
|
|
|
|
|
| |
This is a more accurately named synonym for is_ascii_string(), which is
retained. The old name is misleading to someone programming for
non-ASCII platforms.
|
| |
|
|
|
|
|
|
|
| |
The previous commit fixed a typo caused by it being hard to see the
differences in a long ALL_CAP name. This uses #defines to type the long
name only once, and compile-time variables so the expression for the
length of strings only is specified once.
|
|
|
|
|
|
| |
This was a typo due to the long name. A future commit will make it
cleaner. The sizeof() the wrong name evaluates to the right number on
ASCII platforms, but not EBCDIC.
|
|
|
|
| |
This is explained in the added perldiag entry.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On non EBCDIC platforms currently any UV is encodable as UTF-8. (This
would change if there were 128-bit words). Thus, much code assumes that
nothing can go wrong when converting to UTF-8, and hence does no error
checking.
However, UTF-EBCDIC is only capable of representing code points below
2**32, so if there are 64-bit words, this function can fail.
Prior to this patch, there was no real overflow check, and garbage was
returned by this function if called with too large a number.
While not ideal, the easiest thing to do is to just die for such a
number, like we do for division by 0. This involves changing only code
within this function, and not its many callers.
|
|
|
|
|
|
| |
This function was called with an empty string "" because that string was
not actually needed in the function, except to better identify the
source when there is an error. So change to specify the actual source.
|
|
|
|
|
| |
The comment really belongs inside it, as it refers to those two
lines of code.
|
|
|
|
|
|
|
|
|
|
| |
The call to save_re_context was removed by the previous commit. The
commit before that stopped save_re_context from doing anything.
Commit db2c6cb33 stopped the errsv_save line from triggering
get-magic.
So this comment, added in dc0c6abb4, no longer applies.
|
|
|
|
| |
It is an empty function.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Set PL_curpm to null before we do any swash intialization
in _core_swash_init(). This "hides" the current regop from the
swash code, with the intent of prevent weird reentrancy bugs
when the swashes are initialized.
Long term you could argue that we should just not use the regex
engine to initialize a swash, and then this would be unnecessary.
Thanks to FC for the suggestion!
|
|
|
|
|
|
|
|
| |
Lowercasing a Latin-1 range character results in a latin-1 range
character, so we can use the more restrictive macros that is slightly
more efficient than the general ones. (This difference only is
applicable on EBCDIC platforms, as the macros all expand to nothing on
ASCII ones.
|
| |
|
|
|
|
|
|
|
|
| |
You need to configure with g++ *and* -Accflags=-DPERL_GLOBAL_STRUCT
or -Accflags=-DPERL_GLOBAL_STRUCT_PRIVATE to see any difference.
(g++ does not do the "post-annotation" form of "unused".)
The version code has some of these issues, reported upstream.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Removing context params will save machine code in the callers of these
functions, and 1 ptr of stack space. Some of these funcs are heavily used
as mg_find*. The contexts can always be readded in the future the same way
they were removed. This patch inspired by commit dc3bf40570. Also remove
PERL_UNUSED_CONTEXT when its not needed. See removal candidate rejection
rational in [perl #122106].
-Perl_hv_backreferences_p uses context in S_hv_auxinit
commit 96a5add60f was wrong
-Perl_whichsig_sv and Perl_whichsig_pv wrongly used PERL_UNUSED_CONTEXT
from inception in commit 84c7b88cca
-in authors opinion cast_* shouldn't be public API, no CPAN grep usage,
can't be static and/or inline optimized since it is exported
-Perl_my_unexec move to block where it is needed, make Win32 block, context
free, for inlining likelyhood, private api and only 2 callers in core
-Perl_my_dirfd make all blocks context free, then change proto
-Perl_bytes_cmp_utf8 wrongly used PERL_UNUSED_CONTEXT
from inception in commit fed3ba5d6b
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For S_ functions, remove the context.
For Perl_ functions, add PERL_UNUSED_CONTEXT.
Tricky because sometimes depends on DEBUGGING, and sometimes
on whether we are have PERL_IMPLICIT_SYS.
(Why all the mathoms Perl_is_uni_... and Perl_is_utf8_...
functions are not being whined about is a mystery.)
vutil.c (included via util.c) has one of these, but it's cpan/,
and a known problem: https://rt.cpan.org/Ticket/Display.html?id=96100
|
|
|
|
|
|
| |
This reverts commit 148f39b7de6eae9ddd59e0b0aff691d6abea7aca.
(Still needs more work, but wanted to see how well this passed with Jenkins.)
|
|
|
|
|
|
| |
Definitely not *after* it. It marks the start of the unreachable,
not the first unrechable line. And if they are in that order,
it looks better to linebreak after the lint hint.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- after return/croak/die/exit, return/break are pointless
(break is not a terminator/separator, it's a goto)
- after goto, another goto (!) is pointless
- in some cases (usually function ends) introduce explicit NOT_REACHED
to make the noreturn nature clearer (do not do this everywhere, though,
since that would mean adding NOT_REACHED after every croak)
- for the added NOT_REACHED also add /* NOTREACHED */ since
NOT_REACHED is for gcc (and VC), while the comment is for linters
- declaring variables in switch blocks is just too fragile:
it kind of works for narrowing the scope (which is nice),
but breaks the moment there are initializations for the variables
(the initializations will be skipped since the flow will bypass
the start of the block); in some easy cases simply hoist the declarations
out of the block and move them earlier
Note 1: Since after this patch the core is not yet -Wunreachable-code
clean, not enabling that via cflags.SH, one needs to -Accflags=... it.
Note 2: At least with the older gcc 4.4.7 there are far too many
"unreachable code" warnings, which seem to go away with gcc 4.8,
maybe better flow control analysis. Therefore, the warning should
eventually be enabled only for modernish gccs (what about clang and
Intel cc?)
|
| |
|
|
|
|
|
|
|
| |
This reverts commit 8c2b19724d117cecfa186d044abdbf766372c679.
I don't understand - smoke-me came back happy with three
separate reports... oh well, some other time.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- after croak/die/exit (or return), break (or return!) are pointless
(break is not a terminator/separator, it's a promise of a jump)
- after goto, another goto (!) is pointless
- in some cases (usually function ends) introduce explicit NOT_REACHED
to make the noreturn nature clearer (do not do this everywhere, though,
since that would mean adding NOT_REACHED after every croak)
- for the added NOT_REACHED also add /* NOTREACHED */ since
NOT_REACHED is for gcc (and VC), while the comment is for linters
- declaring variables in switch blocks is just too fragile:
it kind of works for narrowing the scope (which is nice),
but breaks the moment there are initializations for the variables
(they will be skipped!); in some easy cases simply hoist the declarations
out of the block and move them earlier
There are still a few places left.
|
|
|
|
|
|
|
|
|
|
|
| |
Unlike other pod handling routines, autodoc requires the line following
an =head1 to be non-empty for its text to be included in the paragraph
started by the heading. If you fail to do this, silently the text will
be omitted from perlapi. I went through the source code, and where it
was apparent that the text was supposed to be in perlapi, deleted the
empty line so it would be, with some revisions to make more sense.
I added =cuts where I thought it best for the text to not be included.
|
|
|
|
|
| |
This entailed creating new internal functions for some of them to call
so that the functionality can be retained during the deprecation period.
|
|
|
|
|
|
| |
This function is now more efficiently implemented as a synonym for
isUTF8_CHAR(). We retain the Perl_is_utf8_char_buf() function for code
that calls it that way.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This macro will inline the code to determine if a character is
well-formed UTF-8 for code points below a certain value, falling back to
a slower function for larger ones. On ASCII platforms, it will inline
for well-beyond all legal Unicode code points. On EBCDIC, it currently
does it for code points up to 0x3FFF. This could be increased, but our
porting tests do the regen every time to make sure everything is ok, and
making it larger slows that down. This is worked around on ASCII by
normally commenting out the code that generates this info, but including
in utf8.h a version that did get generated. This is static information
and won't change. (This could be done for EBCDIC too, but I chose not
to at this time as each code page has a different macro generated, and
it gets ugly getting all of them in utf8.h)
Using this macro allowed for simplification of several functions in
utf8.c
|
|
|
|
|
| |
This is in preparation for it being called from outside utf8.c. It is
renamed to have a leading underscore to emphasize its private nature
|
|
|
|
| |
Somehow this pod stuff was orphaned from the function it describes.
|
|
|
|
|
|
|
|
|
| |
This was brought to my attention by Jarkko Hietaniemi. The compiler was
complaining that a variable could be used uninitialized. In practice
this doesn't happen, as it would only happen on bad data, and Perl
itself generates the data used. (I suppose if the data got corrupted,
it could happen.) This commit initializes the value unconditionally,
which allows a conditional setting of it to be removed.
|
|
|
|
| |
This automatically generates assertions for pointer arguments, etc.
|
|
|
|
|
|
|
|
|
| |
[perl #121746]
Fix for Coverity perl5 CIDs 29069, 29070, 29071:
Unintended sign extension: ... ... if ... U8 (8 bits unsigned) ... 32
bits, signed ... 64 bits, unsigned ... is greater than 0x7FFFFFFF,
the upper bits of the result will all be 1.
|
|
|
|
|
|
|
|
| |
Fix for Coverity perl5 CID 29081: Uninitialized scalar
variable (UNINIT) uninit_use_in_call: Using uninitialized value slen when
calling Perl_croak.
If all fails, slen hasn't been set, and croak will be called with that.
|
|
|
|
|
|
| |
When converting to UTF-8, one usually doesn't need 14 bytes available
space, which is what previously was claimed It acutally depends on the
value being converted. This change gives the precise value.
|
|
|
|
|
| |
The string input to these two functions must be NUL terminated when the
length parameter is 0.
|
|
|
|
|
| |
The previous commit added braces forming blocks. This indents the
contents of those blocks.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
mktables generates tables of Unicode properties. These are stored in
files to be loaded on-demand. This is because the memory cost of having
all of them loaded would be excessive, and many are rarely used. Hashes
are created in Heavy.pl which is read in by utf8_heavy.pl map the
Unicode property name to the file which contains its definition.
It turns out that nearly half of current Unicode properties are just a
single consecutive ranges of code points, and their definitions are
representable almost as compactly as the name of the files that contain
them.
This commit changes mktables so that the tables for single-range
properties are not written out to disk, but instead a special syntax is
used in Heavy.pl to indicate this and what their definitions are.
This does not increase the memory usage of Heavy.pl appreciably, as the
definitions replace the file names that are already there, but it lowers
the number of files generated by mktables from 908 (in Unicode 6.3) to
507. These files are probably each a disk block, so the disk savings is
not large. But it means that reading in any of these properties is much
faster, as once utf8_heavy gets loaded, no further disk access is needed
to get any of these properties. Most of these properties are obscure,
but not all. The Line and Paragraph separators, for example, are quite
commonly used.
Further, utf8_heavy.pl caches the files it has read in into hashes.
This is not necessary for these, as they are already in memory, so the
total memory usage goes down if a program uses any of these, but again,
since these are small, that amount is not large.. The major gain is not
having to read these files from disk at run time.
Tables that match no code points at all are also represented using this
mechanimsm. Previously, they were expressed as the complements of
\p{All}, which matches everything possible.
|
|
|
|
|
|
| |
av_tindex is a more clearly named synonym for av_len, available starting
in v5.18. This changes the core uses to it, including modules in /ext,
which are not dual-lifed.
|
|
|
|
|
|
|
| |
This mostly indents and outdents base on blocks added or removed by the
previous commit. But there are a few comment changes and vertical
alignment of macro backslash continuation characters, and other
white-space changes
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This large (sorry, I couldn't figure out how to meaningfully split it
up) commit causes Perl to fully support LC_CTYPE operations (case
changing, character classification) in UTF-8 locales.
As a side effect it resolves [perl #56820].
The basics are easy, but there were a lot of details, and one
troublesome edge case discussed below.
What essentially happens is that when the locale is changed to a UTF-8
one, a global variable is set TRUE (FALSE when changed to a non-UTF-8
locale). Within the scope of 'use locale', this variable is checked,
and if TRUE, the code that Perl uses for non-locale behavior is used
instead of the code for locale behavior. Since Perl's internal
representation is UTF-8, we get UTF-8 behavior for a UTF-8 locale.
More work had to be done for regular expressions. There are three
cases.
1) The character classes \w, [[:punct:]] needed no extra work, as
the changes fall out from the base work.
2) Strings that are to be matched case-insensitively. These form
EXACTFL regops (nodes). Notice that if such a string contains only
characters above-Latin1 that match only themselves, that the node can be
downgraded to an EXACT-only node, which presents better optimization
possibilities, as we now have a fixed string known at compile time to be
required to be in the target string to match. Similarly if all
characters in the string match only other above-Latin1 characters
case-insensitively, the node can be downgraded to a regular EXACTFU node
(match, folding, using Unicode, not locale, rules). The code changes
for this could be done without accepting UTF-8 locales fully, but there
were edge cases which needed to be handled differently if I stopped
there, so I continued on.
In an EXACTFL node, all such characters are now folded at compile time
(just as before this commit), while the other characters whose folds are
locale-dependent are left unfolded. This means that they have to be
folded at execution time based on the locale in effect at the moment.
Again, this isn't a change from before. The difference is that now some
of the folds that need to be done at execution time (in regexec) are
potentially multi-char. Some of the code in regexec was trivial to
extend to account for this because of existing infrastructure, but the
part dealing with regex quantifiers, had to have more work.
Also the code that joins EXACTish nodes together had to be expanded to
account for the possibility of multi-character folds within locale
handling. This was fairly easy, because it already has infrastructure
to handle these under somewhat different circumstances.
3) In bracketed character classes, represented by ANYOF nodes, a new
inversion list was created giving the characters that should be matched
by this node when the runtime locale is UTF-8. The list is ignored
except under that circumstance. To do this, I created a new ANYOF type
which has an extra SV for the inversion list.
The edge case that caused the most difficulty is folding involving the
MICRO SIGN, U+00B5. It folds to the GREEK SMALL LETTER MU, as does the
GREEK CAPITAL LETTER MU. The MICRO SIGN is the only 0-255 range
character that folds to outside that range. The issue is that it
doesn't naturally fall out that it will match the CAP MU. If we let the
CAP MU fold to the samll mu at compile time (which it can because both
are above-Latin1 and so the fold is the same no matter what locale is in
effect), it could appear that the regnode can be downgraded away from
EXACTFL to EXACTFU, but doing so would cause the MICRO SIGN to not case
insensitvely match the CAP MU. This could be special cased in regcomp
and regexec, but I wanted to avoid that. Instead the mktables tables
are set up to include the CAP MU as a character whose presence forbids
the downgrading, so the special casing is in mktables, and not in the C
code.
|
|
|
|
|
| |
The UTF8 in the name is kind of misleading, and would be more misleading
after future commits make UTF8 locales special.
|
|
|
|
|
|
|
|
|
|
| |
The documentation says that Perl taints certain operations when subject
to locale rules, such as lc() and ucfirst(). Prior to this commit
there were exceptions when the operand to these functions contained no
characters whose case change actually varied depending on the locale,
for example the empty string or above-Latin1 code points. Changing to
conform to the documentation simplifies the core code, and yields more
consistent results.
|
|
|
|
|
|
|
|
| |
This adds and modifies various comments in several files, rewrapping
some comments to occupy fewer lines but not exceed 79 columns. And
fixes some indentation and other white space issues. It includes
removing trailing white space in lines in regcomp.c. I didn't think it
was worth making a commit for each file.
|
|
|
|
|
|
| |
These lists are read-only. Turning on the flag may allow some
optimisations to be done, including some that may be added in the
future.
|
|
|
|
|
|
|
| |
These are the base names for various macros used in parsing identifiers.
Prior to this patch, parsing a code point above Latin1 caused loading
disk files. This patch causes all the information to be compiled into
the Perl binary.
|
|
|
|
|
|
|
| |
Previous commits in this series have caused all the POSIX classes to be
completely specified at C compile time. This allows us to revise the
base function used by all these macros to use these definitions,
avoiding reading them in from disk.
|
| |
|
|
|
|
|
| |
These functions will be out of the way in mathoms. There were a few
that could not be moved, as-is, so I left them.
|
|
|
|
|
|
| |
In all these cases, there is an already existing macro that does exactly
the same thing as the code that this commit replaces. No sense
duplicating logic.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This bottom level function decodes the first character of a UTF-8 string
into a code point. It is discouraged from using it directly. This
commit cleans up some of the warnings it can raise. Now, tests for
malformations are done before any tests for other potential issues. One
of those issues involves code points so large that they have never
appeared in any official standard (the current standard has scaled back
the highest acceptable code point from earlier versions). It is
possible (though not done in CPAN) to warn and/or forbid these code
points, while accepting smaller code points that are still above the
legal Unicode maximum. The warning message for this now includes the
code point if representable on the machine. Previously it always
displayed raw bytes, which is what it still does for non-representable
code points.
|